center for e-business technology seoul national university seoul, korea optimization of multi-domain...

21
Center for E-Business Technology Seoul National University Seoul, Korea Optimization of Multi-Domain Queries on the Web Daniele Braga, Stefano Ceri, Florian Daniel, Davide Martinenghi Dipartimento di Elettronica e Informazione – Politecnico di Milano VLDB 2008 2009. 02. 19. Presented by Babar Tareen, IDS Lab., Seoul National University Based on Conference Presentation

Upload: avice-jordan

Post on 04-Jan-2016

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Center for E-Business Technology Seoul National University Seoul, Korea Optimization of Multi-Domain Queries on the Web Daniele Braga, Stefano Ceri, Florian

Center for E-Business TechnologySeoul National University

Seoul, Korea

Optimization of Multi-Domain Queries on the Web

Daniele Braga, Stefano Ceri, Florian Daniel, Davide Martinenghi

Dipartimento di Elettronica e Informazione – Politecnico di Milano

VLDB 2008

2009. 02. 19.

Presented by Babar Tareen, IDS Lab., Seoul National University

Based on Conference Presentation

Page 2: Center for E-Business Technology Seoul National University Seoul, Korea Optimization of Multi-Domain Queries on the Web Daniele Braga, Stefano Ceri, Florian

Copyright 2008 by CEBT

Mutli-Domain Queries

Queries that can be answered by combining knowledge from two or more domains

Example

Where can I attend an interesting database workshop close to a sunny beach?

Who are the strongest experts on service computing based upon their recent publication record and accepted European projects ?

Can I spend an April week-end in a city served by a low-cost direct flight from Milano offering a Mahler's symphony?

2

Page 3: Center for E-Business Technology Seoul National University Seoul, Korea Optimization of Multi-Domain Queries on the Web Daniele Braga, Stefano Ceri, Florian

Copyright 2008 by CEBT

Intro

General-purpose search engines (e.g. Yahoo, Google)

Very large search space, yet

Not able to index deep Web data

Domain-specific search engines (e.g. an airline’s flight search form, Amazon’s book search facility)

Typically of high quality, but

Limited to restricted domains

We lack the ability to answer multi-domain queries

3

Page 4: Center for E-Business Technology Seoul National University Seoul, Korea Optimization of Multi-Domain Queries on the Web Daniele Braga, Stefano Ceri, Florian

Copyright 2008 by CEBT

In general:“Given a query over a set of ser-vices, find the query plan that mini-mizes the expected execution cost according to a given metric in order to obtain the best k answers.”

Scenario: a multi-domain query

• Reference query: – “Find all database conferences in the next six months in

locations where the average temperature is at least 28°C degrees and for which a cheap travel solution including a luxury accommodation exists.”

• Answering this query requires:– Finding interesting conferences in the desired timeframe via

online services by the scientific community;

– Understanding whether the conference location is served by low-cost flights;

– Finding luxury hotels close to the conference location with available rooms; and

– Checking the expected average temperature of the location

4

Page 5: Center for E-Business Technology Seoul National University Seoul, Korea Optimization of Multi-Domain Queries on the Web Daniele Braga, Stefano Ceri, Florian

Copyright 2008 by CEBT

Overall Picture

5

Page 6: Center for E-Business Technology Seoul National University Seoul, Korea Optimization of Multi-Domain Queries on the Web Daniele Braga, Stefano Ceri, Florian

Copyright 2008 by CEBT

Preliminaries – (1)

Characteristics of information sources (services) Search services: return answers in ranking order Exact services: indistinguishible tuples (no ranking) Services have access patterns

– Combination of Input and Output parameters corresponding to different ways of invocation

6

Page 7: Center for E-Business Technology Seoul National University Seoul, Korea Optimization of Multi-Domain Queries on the Web Daniele Braga, Stefano Ceri, Florian

Copyright 2008 by CEBT

Preliminaries – (2)

Characteristics of information sources (services) Expected result size per invocation (ERSPI):

– proliferative (ERSPI>1)

– selective (0≤ERSPI≤ 1) services

Chunking/paging of result sets: bulk vs. chunked services

Joins Can be considered system services ERSPI: selectivity of the join condition, ERSPIs of services

– Product of the ERSPI values of the services multiplied by the selectivity of the join condition

7

Page 8: Center for E-Business Technology Seoul National University Seoul, Korea Optimization of Multi-Domain Queries on the Web Daniele Braga, Stefano Ceri, Florian

Copyright 2008 by CEBT

Preliminaries – (3)

Query plan: indicates the invocations of services and their conjunctive composition through joins Represented as directed acyclic graphs (DAGs) Nodes = atoms in the conjuncitve query (service, join) Arcs = precedence constaints + data flows Joins: join strategy + number of fetches per service

8

Directed Acyclic Graph

Page 9: Center for E-Business Technology Seoul National University Seoul, Korea Optimization of Multi-Domain Queries on the Web Daniele Braga, Stefano Ceri, Florian

Copyright 2008 by CEBT

Preliminaries – (4)

Cost metrics: associate a cost to a plan Sum cost metric = sum of the costs of each operator Execution time metric = expected time from query input

to result output Request-response cost metric = special case of sum cost

metric where each invocation has a costs of 1

9

Page 10: Center for E-Business Technology Seoul National University Seoul, Korea Optimization of Multi-Domain Queries on the Web Daniele Braga, Stefano Ceri, Florian

Copyright 2008 by CEBT

Optimization Approach Exploring a highly combinatorial solution space

1st Phase: selection of a given query rewriting such that every service is called with one of available access patterns

2nd Phase: selection of query plan

3rd Phase: assignment of the exact number of fetches to be performed over chunked services

10

Page 11: Center for E-Business Technology Seoul National University Seoul, Korea Optimization of Multi-Domain Queries on the Web Daniele Braga, Stefano Ceri, Florian

Copyright 2008 by CEBT

Services, access patterns, queries

Web services and access patterns:

• The example query (in Datalog-like syntax):

Services with alternative access patterns

11

Page 12: Center for E-Business Technology Seoul National University Seoul, Korea Optimization of Multi-Domain Queries on the Web Daniele Braga, Stefano Ceri, Florian

Copyright 2008 by CEBT

Query plans

Representation as DAGs

Placing a node = invoking the respective service/join

Two nodes connected by an arc = sequential execution

Two nodes without connection = parallel execution

Graphical notation (note the parallel vs. pipe join):

12

Page 13: Center for E-Business Technology Seoul National University Seoul, Korea Optimization of Multi-Domain Queries on the Web Daniele Braga, Stefano Ceri, Florian

Copyright 2008 by CEBT

Joing strategies for parallel joins

Nested loop: one service “dominates” the other

Merge-scan: no a-priori distinction of services

13

Page 14: Center for E-Business Technology Seoul National University Seoul, Korea Optimization of Multi-Domain Queries on the Web Daniele Braga, Stefano Ceri, Florian

Copyright 2008 by CEBT

Annotated query plans

In order to estimate the number of tuples in output, we further need to know:

The number of tuples in output of each service

The number of fetches for each chunked service

The join strategy for each parallel join

The final annotation is the output of the optimization

14

Page 15: Center for E-Business Technology Seoul National University Seoul, Korea Optimization of Multi-Domain Queries on the Web Daniele Braga, Stefano Ceri, Florian

Copyright 2008 by CEBT

Instrumented branch and bound

Possible service combinations:

Not feasible: City would need to be an input parameter to the query!

α1 has more input fields than α2

Access pattern selection

Heuristic: “Bound is better” = the more input fields in the access pattern, the better

Query plan selection

Heuristic: “Selective and parallel are better” = selective services in series (with increasing ERSPI) and proliferative services in parallel

Chunked service selection

Heuristic: “Greedy and square are better” = either we increment the number of fetches to chunked services individually (greedy) or together (square)

15

Page 16: Center for E-Business Technology Seoul National University Seoul, Korea Optimization of Multi-Domain Queries on the Web Daniele Braga, Stefano Ceri, Florian

Copyright 2008 by CEBT

Final annotation of query plan

Execution time cost metric:

Service characterization:Fetching factors:

Annotated query plan

16

Page 17: Center for E-Business Technology Seoul National University Seoul, Korea Optimization of Multi-Domain Queries on the Web Daniele Braga, Stefano Ceri, Florian

Copyright 2008 by CEBT

Query execution

Execution environment

Service registration: signature, patterns, ERSPI, repsonse times, chunk sizes, indication of join strategy,...

Service orchestration: query execution

Multi-threading: to leverage parallelisms

Logical caching (speed + elimination of duplicates)

No cache = each call individually repeated

One-call cache = caching of the last call to each service

Optimal cache = all calls to all services are cached

17

Page 18: Center for E-Business Technology Seoul National University Seoul, Korea Optimization of Multi-Domain Queries on the Web Daniele Braga, Stefano Ceri, Florian

Copyright 2008 by CEBT

# of calls under varying chache settings

18

Page 19: Center for E-Business Technology Seoul National University Seoul, Korea Optimization of Multi-Domain Queries on the Web Daniele Braga, Stefano Ceri, Florian

Copyright 2008 by CEBT

Results of the optimal plan

Screenshot of the prototype query engine

19

Page 20: Center for E-Business Technology Seoul National University Seoul, Korea Optimization of Multi-Domain Queries on the Web Daniele Braga, Stefano Ceri, Florian

Copyright 2008 by CEBT

Conclusion

In this work, we have

defined an formal model for the optimization of multi-domain queries over web services (conjunctive queries)

defined query plans similar to relational physical access plans

derived an optimization technique based on a classical branch and bound technique

given experimental evidence that the proposed model fits real world settings (existing web service and wrapped ones)

Next

Generic query engine + declarative rep. of query plans

User interface for the mashup of sevices/queries

20

Page 21: Center for E-Business Technology Seoul National University Seoul, Korea Optimization of Multi-Domain Queries on the Web Daniele Braga, Stefano Ceri, Florian

Copyright 2008 by CEBT

Discussion

Very Simple Experimental Setup

No details about Semi-automatically generated Wrappers

How to decide which service to select for a specific domain?

How to map Input Output parameters between different services?

If we have to pre-program the system for new domains, it is like developing a special purpose application

How effective is the system for answering Multi-Domain Queries?

21