center for e-business technology seoul national university seoul, korea optimization of multi-domain...
TRANSCRIPT
Center for E-Business TechnologySeoul National University
Seoul, Korea
Optimization of Multi-Domain Queries on the Web
Daniele Braga, Stefano Ceri, Florian Daniel, Davide Martinenghi
Dipartimento di Elettronica e Informazione – Politecnico di Milano
VLDB 2008
2009. 02. 19.
Presented by Babar Tareen, IDS Lab., Seoul National University
Based on Conference Presentation
Copyright 2008 by CEBT
Mutli-Domain Queries
Queries that can be answered by combining knowledge from two or more domains
Example
Where can I attend an interesting database workshop close to a sunny beach?
Who are the strongest experts on service computing based upon their recent publication record and accepted European projects ?
Can I spend an April week-end in a city served by a low-cost direct flight from Milano offering a Mahler's symphony?
2
Copyright 2008 by CEBT
Intro
General-purpose search engines (e.g. Yahoo, Google)
Very large search space, yet
Not able to index deep Web data
Domain-specific search engines (e.g. an airline’s flight search form, Amazon’s book search facility)
Typically of high quality, but
Limited to restricted domains
We lack the ability to answer multi-domain queries
3
Copyright 2008 by CEBT
In general:“Given a query over a set of ser-vices, find the query plan that mini-mizes the expected execution cost according to a given metric in order to obtain the best k answers.”
Scenario: a multi-domain query
• Reference query: – “Find all database conferences in the next six months in
locations where the average temperature is at least 28°C degrees and for which a cheap travel solution including a luxury accommodation exists.”
• Answering this query requires:– Finding interesting conferences in the desired timeframe via
online services by the scientific community;
– Understanding whether the conference location is served by low-cost flights;
– Finding luxury hotels close to the conference location with available rooms; and
– Checking the expected average temperature of the location
4
Copyright 2008 by CEBT
Overall Picture
5
Copyright 2008 by CEBT
Preliminaries – (1)
Characteristics of information sources (services) Search services: return answers in ranking order Exact services: indistinguishible tuples (no ranking) Services have access patterns
– Combination of Input and Output parameters corresponding to different ways of invocation
6
Copyright 2008 by CEBT
Preliminaries – (2)
Characteristics of information sources (services) Expected result size per invocation (ERSPI):
– proliferative (ERSPI>1)
– selective (0≤ERSPI≤ 1) services
Chunking/paging of result sets: bulk vs. chunked services
Joins Can be considered system services ERSPI: selectivity of the join condition, ERSPIs of services
– Product of the ERSPI values of the services multiplied by the selectivity of the join condition
7
Copyright 2008 by CEBT
Preliminaries – (3)
Query plan: indicates the invocations of services and their conjunctive composition through joins Represented as directed acyclic graphs (DAGs) Nodes = atoms in the conjuncitve query (service, join) Arcs = precedence constaints + data flows Joins: join strategy + number of fetches per service
8
Directed Acyclic Graph
Copyright 2008 by CEBT
Preliminaries – (4)
Cost metrics: associate a cost to a plan Sum cost metric = sum of the costs of each operator Execution time metric = expected time from query input
to result output Request-response cost metric = special case of sum cost
metric where each invocation has a costs of 1
9
Copyright 2008 by CEBT
Optimization Approach Exploring a highly combinatorial solution space
1st Phase: selection of a given query rewriting such that every service is called with one of available access patterns
2nd Phase: selection of query plan
3rd Phase: assignment of the exact number of fetches to be performed over chunked services
10
Copyright 2008 by CEBT
Services, access patterns, queries
Web services and access patterns:
• The example query (in Datalog-like syntax):
Services with alternative access patterns
11
Copyright 2008 by CEBT
Query plans
Representation as DAGs
Placing a node = invoking the respective service/join
Two nodes connected by an arc = sequential execution
Two nodes without connection = parallel execution
Graphical notation (note the parallel vs. pipe join):
12
Copyright 2008 by CEBT
Joing strategies for parallel joins
Nested loop: one service “dominates” the other
Merge-scan: no a-priori distinction of services
13
Copyright 2008 by CEBT
Annotated query plans
In order to estimate the number of tuples in output, we further need to know:
The number of tuples in output of each service
The number of fetches for each chunked service
The join strategy for each parallel join
The final annotation is the output of the optimization
14
Copyright 2008 by CEBT
Instrumented branch and bound
Possible service combinations:
Not feasible: City would need to be an input parameter to the query!
α1 has more input fields than α2
Access pattern selection
Heuristic: “Bound is better” = the more input fields in the access pattern, the better
Query plan selection
Heuristic: “Selective and parallel are better” = selective services in series (with increasing ERSPI) and proliferative services in parallel
Chunked service selection
Heuristic: “Greedy and square are better” = either we increment the number of fetches to chunked services individually (greedy) or together (square)
15
Copyright 2008 by CEBT
Final annotation of query plan
Execution time cost metric:
Service characterization:Fetching factors:
Annotated query plan
16
Copyright 2008 by CEBT
Query execution
Execution environment
Service registration: signature, patterns, ERSPI, repsonse times, chunk sizes, indication of join strategy,...
Service orchestration: query execution
Multi-threading: to leverage parallelisms
Logical caching (speed + elimination of duplicates)
No cache = each call individually repeated
One-call cache = caching of the last call to each service
Optimal cache = all calls to all services are cached
17
Copyright 2008 by CEBT
# of calls under varying chache settings
18
Copyright 2008 by CEBT
Results of the optimal plan
Screenshot of the prototype query engine
19
Copyright 2008 by CEBT
Conclusion
In this work, we have
defined an formal model for the optimization of multi-domain queries over web services (conjunctive queries)
defined query plans similar to relational physical access plans
derived an optimization technique based on a classical branch and bound technique
given experimental evidence that the proposed model fits real world settings (existing web service and wrapped ones)
Next
Generic query engine + declarative rep. of query plans
User interface for the mashup of sevices/queries
20
Copyright 2008 by CEBT
Discussion
Very Simple Experimental Setup
No details about Semi-automatically generated Wrappers
How to decide which service to select for a specific domain?
How to map Input Output parameters between different services?
If we have to pre-program the system for new domains, it is like developing a special purpose application
How effective is the system for answering Multi-Domain Queries?
21