distributed query processing +

Upload: prashant-deep

Post on 08-Apr-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/7/2019 Distributed Query Processing +

    1/19

    Al-Quds University

    Distributed Query Processing

    Database

    For Dr : badee sartawi.

    Name : Hiba Jafar.

    Date :30-5-2003

    1

  • 8/7/2019 Distributed Query Processing +

    2/19

    Table of contents:

    Sec.1 An Imtroduction of (DQP) .

    Sec.2 The Objectives.

    Sec.3 General area of research .

    Sec.4 History .

    Sec.5 What is (DQP).

    Sec.6 Why to use (DQP). issues make distributed data processing

    still a complex .

    Sec.7 Where and How (DQP) work.

    Sec.8 Query processing .

    Sec.9 ARCHITECTURE OF THE SYSTEM .

    Sec.10 some advantages of (DQP).

    Sec.11 Running example .

    Sec.12 Summary .

    . Referancing ..

    2

  • 8/7/2019 Distributed Query Processing +

    3/19

    Abstract :

    The paper presents the textbook architecture for distributed query processing and a

    series of techniques that are particularly useful for distributed database systems. and

    shows how query processing works in these systems.

    A very large body of work in the general area of database systems exists. All this

    work can be roughly classified into work on architectures and techniques for

    .transaction processing (i.e., quickly processing small update operations), work on

    query processing (i.e., mostly read operations that explore large amounts of data), and

    work on data models, languages and user interfaces for advanced applications. In this

    paper, we will focus primarily on query processing. A discussion of transaction

    processing and of alternative data models is beyond the scope of this paper. This

    paper will not even be able to give a full coverage of all query processing techniques

    used today.

    1. INTRODUCTION

    A distributed database (DDB)is a collection of multiple, logically interrelated databases

    distributed over a computer network . The distribution of databases on a network achieves

    the advantages of performance, reliability, availability and modularity that are inherent in

    distributed systems.

    As with traditional centralized databases, distributed database systems (DDBS's) must

    provide an efficient user interface that hides all of the underlying data distribution details of

    the DDB from the users. The use of a relational query allows the user to specify a description

    of the data that is required without having to know where the data is physically located.

    3

  • 8/7/2019 Distributed Query Processing +

    4/19

    The retrieval of data from different sites in a DDB is referred to as distributed query

    processing.

    Oracle distributed database systems employ a distributed processing architecture. Thus, an

    Oracle database server acts as a client when it requests data that another Oracle database

    server manages. For example, the following query accesses data from the local database as

    well as the remote sales database. The first table (EMP) found in site1 and the second table

    (DEPT) found in site2:

    SELECT ename, dname

    FROM scott.emp e, [email protected]_auto.com d

    WHERE e.deptno = d.deptno

    So a distributed query is one that selects data from databases located at multiple sites in a

    network and distributed processingperforms computations on multiple CPUs to achieve a

    single result. Any SQL data manipulation statement that references tables at sites otherthan

    the site an application program is submitted to for compilation (i.e., the query site) is a called

    distributed query and need to be processed.

    Query processingis much more difficultin distributedenvironment than in centralized

    environmentbecause:

    A large number of parameters affect the performance of distributed queries.

    Relations involved in a distributed query may be fragmented and/or replicated.

    With many sites to access, query response time may become very high

    .

    4

  • 8/7/2019 Distributed Query Processing +

    5/19

    It is quite evident that the performance of a DDBS is critically dependant upon the ability of

    the query optimization algorithm to derive efficient query processing strategies. DDBMS

    query optimization algorithms attempts to reduce the quantity of data transferred.

    Minimizingthe quantity of data transferredis a desirable optimization criterion since more

    data transported across telecommunications networks requires more time and labor. The

    distributed query optimization has several problems that relate to: cost model, larger set of

    queries, optimization cost & execution cost tradeoff, and optimization / reoptimization

    interval.

    2.The Objectevs:

    * The objectives of this paper is :

    To learn what is distrebuted query processing .

    To understand the main idea in (DQP) .

    To understand how does it work .

    To know the target of (DQP).

    * The objectives of use (DQP) itself :

    The goal is to execute such queries as efficiently as possible in order to minimize the

    response time that users must wait for answers or the time application programs are

    delayed. And to minimizes the total communication costs associated with a query,

    to improved throughput via parallel processing, sharing of data and equipment, and

    modular expansion of data management capacity. In addition, when redundant data is

    maintained, one also achieves increased data reliability and improved response time.

    3.General area of research : This paper shows that there are many different reasons to

    rely on distributed architectures and correspondingly many different kinds of distributed

    systems exist. Sometimes it is only the software and not the hardware that is distributed.The

    purpose of this paper is to give a comprehensive overview of what query processing

    5

  • 8/7/2019 Distributed Query Processing +

    6/19

    techniques are needed to implement any kind of distributed database and information system.

    It is assumed that users and application programs issue queries using a declarative query

    language such as SQL or OQL and without knowing where and in which format the data is

    stored in the distributed system..

    4. History(Background and Motivation ):

    Researchers and practitioners have beeninterested in distributed database systems

    since the 1970s. At that time,the main focus was on supporting distributed

    data management for large corporations and organizations that kept their data at

    different offices or subsidiaries.

    Although there was a clear need and many good ideas and prototypes, and Distributed

    Ingres the early efforts in building distributed database systems were never

    commercially successful In some aspects, the early distributed database systems were

    ahead of their time. First, communication technology was not stable enough to ship

    megabytes of data as required for these systems. Second, large businesses somehow

    managed to survive without sophisticated distributed database technology by sending

    tapes, diskettes, or just paper to exchange data between their offices. Today, the

    situation has changed dramatically.

    Distributed data processing is both feasible and needed. Almost all major database

    system vendors offer products to support distributed data processing (e.g.,IBM,

    Informix, Microsoft, Oracle, Sybase), and large database application systems have a

    distributed architecture (e.g., business application systems such as Baan IV, Oracle

    Finance, Peoplesoft 7.5, and SAP R/3). Distributed data processing is feasible because

    of recent technological advances (e.g., hardware, software protocols,

    6

  • 8/7/2019 Distributed Query Processing +

    7/19

    standards). Distributed data processing is needed because of changing business

    requirements, which have made distributed data processing cost-effective and in

    certain situations the only viable option.

    5.What is Distributed Query Processing:

    A distributed database (DDB) consists of copies of datafiles (often redundant)

    distributed on a network of computers.

    Query processing (or data retrieval) is an important problem in distributed

    databases. Accessing data distributed in different computer sites necessitates

    the transmission of data over communication links. Since communication delay is

    substantial, the database management system must devise an efficient strategy to

    coordinate data processing at local computer sites and data transmission between

    sites. This problem is enhanced in a redundant database because which of the

    redundant copies to access becomes an important issue.

    6.Why to use (DQP) :

    Distributed data processing is becoming a reality. Businesses want to do it for many

    reasons, and they often must do it in order to stay competitive. While much of the

    infrastructure for distributed data processing is already there (e.g., modern network

    technology) Specifically, businesses are beginning to rely on distributed rather than

    centralized databases for the following reasons:

    1.Cost and scalability.

    Today, one thousand PC processors are cheaper and significantly more powerful than

    one big mainframe computer. So, it makes economic sense to replace a mainframe by

    7

  • 8/7/2019 Distributed Query Processing +

    8/19

    a network of small, off-the-shelf processors. Furthermore, it is very difficult to up-

    size a mainframe computer if a company grows, while new PCs can be added to the

    network at any time in order to meet a companys new requirements. High availability

    can be achieved by mirroring (replicating) data.

    2.Integration of different software modules.

    It has become clear that no single software package can meet all the requirements of a

    company. Companies must, therefore, install several different packages, each

    potentially with its own database, and the result is a distributed database system. Even

    single software packages offered by one vendor have a distributed, componentbased

    architecture so that the vendor can market and offer upgrades for every component

    individually.

    3.Integration of legacy systems.

    The integration of legacy systems is one particular example that demonstrates how

    some companies are forced to rely on distributed data processing in which their old

    legacy systems need to coexist with new modern systems.

    4.New applications.

    There are a number of new emerging applications that rely heavily on distributed

    database technology; examples are workflow management, computer-supported

    collaborative work, tele-conferencing, and electronic commerce.

    5.Market forces.

    Many companies are forced to reorganize their businesses and use state-of-the-art

    distributed information technology in order to remain competitive. As an example,

    people will probably not eat more Pizza because of the Internet, but a Pizza delivery

    service is definitely going to lose some of its market share if it does not allow people

    to order Pizza on the Web.

    8

  • 8/7/2019 Distributed Query Processing +

    9/19

    a number of issues make distributed data processing still a complex undertaking:

    (1) distributed systems can become very large, involving thousands of

    heterogeneous sites including PCs and mainframe server machines; (2) the state of a

    distributed system changes rapidly because the load of sites varies over time and new

    sites are added to the system; (3) legacy systems need to be integratedsuch legacy

    systems usually have not been designed for distributed data processing and now need

    to interact with other (modern) systems in a distributed environment.

    7.Where and How (DQP) work :

    In the field of data management tools, the developments in distributed computing

    technologies lead up to distributed database management systems. These systems

    should shield the users from complexities of the distribution. Distributed query

    processing refers to the process that obtains an answer for a global query from

    distributed sources. A pre-existing local database should be used as one of the

    sources. This paper deals with principles of query processing in distributed database

    systems and describes some specific architectural issues, which enable the integration

    of pre-existing local databases into a distributed system.

    A distributed database is a collection of multiple, logically interrelated local

    databases distributed over a computer network. A distributed database

    management system (distributed DBMS) is a software system that permits the

    management of a distributed database and makes the distribution transparent to users.

    A distributed database together with a distributed DBMS is called distributed

    database system. Every local database has its own exported scheme that describes

    the local data available for the system. An exported scheme should be a subset of a

    9

  • 8/7/2019 Distributed Query Processing +

    10/19

    local scheme. At the global system level, exported schemes of all local databases are

    integrated into a global data scheme.

    Different authors differ in classifying distributed DBMSs. Usually the classification

    deals with three properties: autonomy, integration, and heterogeneity. Autonomy

    refers to the degree to which local DBMSs can operate independently. Integration

    describes the degree to which local databases are integrated into the global system,

    that is, whether any distribution is transparent to the user or not. Heterogeneity

    covers hardware, networking protocols, and local DBMS (including data model,

    query language, interface...) heterogeneity.

    8. QUERY PROCESSING

    We will deal with queries over relational data model . a distributed DBMS provides

    transparent access to distributed resources. There must be a module in the system

    architecture that gets a global query and manages a distributed evaluation. The whole

    process usually goes through the following steps:

    parsing the global query,

    query optimization,

    query execution.

    When parsingthe global query, each global relation is substituted by the expression

    with local relations according to global scheme. Then the query is simplified by

    eliminating redundant predicates. Finally, the query is transformed into relation

    algebra expression. This intermediate expression of the query is called its canonical

    form. During query optimization step, a distributed execution plan that obtains the

    answer is prepared. Although query languages are usually non-procedural, the

    execution plan gives the procedure how to extract data. The execution plan says

    10

  • 8/7/2019 Distributed Query Processing +

    11/19

    which local data are required, how to access them, which operations must be done at

    which sites. Moreover, the execution plan should be optimized. it minimizes the

    execution cost. Finally, the plan is executed in the query execution step.

    9.ARCHITECTURE OF THE SYSTEM

    Every local database has its exported relational scheme that describes local data

    accessible from the global system. The local exported scheme should be just a subset

    of a local scheme used by a local user. Exported schemes of all local databases are

    integrated into a global relation scheme. Every global relation is expressed by a

    relation algebra expression over relations from exported schemes. The relation

    algebra expression should be arbitrary complicated, but the MINUS operator mustnt

    be used. The global user expresses queries over global relations only.

    There are four different functional units (processes) in the functional model of the

    system: the client, the distributed data server, the local database unit, and the partial

    query integrator. Every unit should be at different site of the network. The client takes

    over users query and presents gathered partial results back to the user. After the

    client took up users query, it sends the query to the distributed data server and waits

    to the servers answer. In the servers answer, there is just an information which sites

    will deliver partial results to the client. The client waits for these partial results and

    their union presents to the user. Notice that UNION operation is the only one that the

    client must be able to execute. The distributed data server manages a distributed

    query execution. It parses the query and generates the execution plan how to extract

    required data from local databases. The server informs the client about the plan and

    sends requests to the local database units and partial query integrators. Then, the

    server waits and informs the client about some errors during execution.

    11

  • 8/7/2019 Distributed Query Processing +

    12/19

    The local database unit and the partial query integrator are the only units that

    access the local data. They are based on the underlying local DBMSs. The units

    process runs on the same site as the local DBMS and manages the local database

    access and communication with other sites in the system. The local database unit

    accesses only the local data. There is coded what local data are required and where to

    send them in the distributed data servers request. The addressee of acquired local

    data should be either the client or the partial query integrator. Additionally to the

    function of the local database unit, the partial query integrator integrates incoming

    partial results and possibly local data according to the distributed data server requests.

    Again, the addressee of the results should be either another partial query integrator or

    the client.

    Fig. 1 Functional model

    12

  • 8/7/2019 Distributed Query Processing +

    13/19

    When there any error emerges in the system during the distributed executing, the

    distributed data server is notified immediately. Then the server notifies the waiting

    client and aborts all related requests.

    10.Some advantages of (DQP) :

    The distribution of data in a network also offers advantages over the centralization of

    data at one computer. These advantages include: improved throughput via parallel

    processing, sharing of data and equipment, and modular expansion of data

    management capacity. In addition, when redundant data is maintained, one also

    achieves increased data reliability and improved response time

    Electronic market places and virtual enterprises have become very important

    applications for query processing . Building a scalable virtual business-to-business

    (B2B) market place with hundreds or thousands of participating suppliers requires

    highly flexible, distributed query processing capabilities. Architecting such an

    electronic market place as a data warehouse by integrating allthe data from all

    participating enterprises in one centralized data repository incurs severe problems:

    Security and privacy violations: The participants of the market place have to

    relinquish the control over their data and entrust sensitive information (e.g., pricing

    conditions) to the market place host.

    Coherence problems: The coherence of highly dynamic data, such as availability

    and shipping information, may be violated due to outdated materialized data in the

    market places data warehouse.

    13

  • 8/7/2019 Distributed Query Processing +

    14/19

    Schema integration problems: Using the warehouse approach all relevant data from

    all participants have to be converted `a priori into the same format. Often, it would be

    easier to leave the data inside the participants information systems, e.g., legacy

    systems, within the local sites, and apply particular local wrapper/transformer

    operations. This way, data is only converted on demandand the most recent coherent

    state of the data is returned.

    Fixed query operators: In a fully integrated (data warehouse-like) electronic market

    place, all information is converted into materialized data. This is often not desirable in

    such complex applications like electronic procurement/bidding. For example, in

    pricing offers one would like to have vastly different choices:

    fixed pricing via materialized data

    operators which calculate the prices based on a multitude of local and global

    parameters (identity of the consumer company, availability, local plant utilization,

    subcontractor prices, etc.)

    even human interaction during the processing of such complex e-procurement

    queries is desirable. In some participating enterprises the pricing could be done

    by a human via an interactive query operator.

    11.Running Example:

    We demonstrate the HyperQuery technique with a scenario of the car manufacturing

    industry. We assume a hierarchical supply chain of suppliers and sub-contractors. A

    14

  • 8/7/2019 Distributed Query Processing +

    15/19

    typical process of e-procurement to cover unscheduled demands of the production is

    to query a market place for these products and to select the incoming offers by price,

    terms of delivery, available quantity, etc. The price of the needed products can vary

    by customer/supplier-specific sales discounts, the quantity of materials to be provided,

    duties, plant utilization, etc. Thus the price cannot be a materialized attribute as in

    traditional query processing systems. Instead it is an individually calculated,

    dynamically changing attribute and a hyperlink to the supplier is contained, where

    the price will be computed on demand.

    In traditional distributed query processing systems such a query can only be executed

    if a global schema exists or all local databases are replicated at the market place.

    Considering an environment, where hundreds of suppliers participate in a market

    place, one global query which integrates the sub-queries for all participants would be

    too complex and error-prone, i.e., if one suppliers host is down, the whole query

    execution would fail. Following our approach the suppliers have to register their

    products at the market place, which they want to participate in, and specify, by which

    sub-plans the price information can be computed at theirsites. This calculation can be

    arbitrarily complex and involve their subcontractors, too. The allocation schema given

    by the data at the market place is exploited for execution.

    Figure 1 shows an SQL-like query, that returns the prices and suppliers of all needed

    products. The execution is stopped at the latest at the given value of the expires

    attribute. Only the results gathered so far are considered. Figure 2 shows two possible

    execution traces of this queryboth are supported by our evaluation technique. In the

    hierarchical execution of Figure 2(a) the resulting objects flow back to the sites,

    where the original input objects came from, whereas in the broadcast execution of

    15

  • 8/7/2019 Distributed Query Processing +

    16/19

    Figure 2(b) the objects do not flow all the way through intermediates back to the

    client, but are routed directly to the client, which issued the query. 4

    selectp.ProductDescription, c.Supplier, c.AdditionalData, c.Price

    fromNeededProducts p, Catalog@MarketPlace c

    wherep.ProductDescription = c.ProductDescription

    order byp.ProductDescription, c.Pric

    expires Friday, May 18, 2001 5:00:00 PM CET

    Figure 1: Example Query of the Car Manufacturer

    12.Summary:

    A distributed database (DDB)is a collection of multiple, logically interrelated databases

    distributed over a computer network . The retrieval of data from different sites in a DDB is

    referred to as distributed query processing.

    16

  • 8/7/2019 Distributed Query Processing +

    17/19

    Query processingis much more difficultin distributedenvironment than in centralized

    environmentbecause: 1)A large number of parameters affect the performance of

    distributed queries. 2) Relations involved in a distributed query may be fragmented and/or

    replicated. 3) With many sites to access, query response time may become very high

    .

    businesses are beginning to rely on distributed rather than centralized databases for the

    following reasons: Cost and scalability , ntegration of different software ,Integration of

    legacy systems ,New applications and Market forces .

    A distributed database management system (distributed DBMS) is

    a software system that permits the management of a distributed

    database and makes the distribution transparent to users. A distributed

    database together with a distributed DBMS is called distributed

    database system . Different authors differ in classifying distributed

    DBMSs. Usually the classification deals with three properties: autonomy,

    integration, and heterogeneity.the process of distributed database go

    through three steps : parsing the global query, query optimization and

    query execution .

    Some advantages of (DQP) : : improved throughput via parallel

    processing, sharing of data and equipment, and modular expansion of

    data management capacity. In addition, when redundant data is

    17

  • 8/7/2019 Distributed Query Processing +

    18/19

    maintained, one also achieves increased data reliability and improved

    response time.

    Electronic market places and virtual enterprises have become very important

    applications for query processing Building a scalable virtual business-to-business

    (B2B) market place with hundreds or thousands of participating suppliers requires

    highly flexible, distributed query processing capabilities. Architecting such an

    electronic market place as a data warehouse by integrating allthe data from all

    participating enterprises in one centralized data repository incurs severe problems

    Security and privacy violations

    Coherence problems

    Schema integration problems

    Fixed query operators

    13.Referances :

    All the contant of this paper is taking from internsts paper about the subjwct .the

    following is named of this paper :

    p-1107-10619292.pdf .

    p422-kossmann.pdf .

    vldb2001.hperqueries.pdf .

    papadimos2003 .

    doctorweek .

    18

  • 8/7/2019 Distributed Query Processing +

    19/19

    distributed_query.pdf .