[ieee 2008 9th ieee/acm international conference on grid computing (grid) - tsukuba...

8
Service-based Data Integration using OGSA-DQP and OGSA-WebDB Steven Lynden, Said Mirza Pahlevi and Isao Kojima National Institute of Advanced Industrial Science and Technology (AIST), Japan [email protected] [email protected] [email protected] Abstract OGSA-DQP is a service-based distributed query pro- cessor that is able to execute queries over data ser- vices and combine data integration with data analysis by invoking Web services. OGSA-DQP currently sup- ports only one type of data source, relational databases wrapped using OGSA-DAI (a middleware tool that ex- poses XML or relational database management sys- tems as Grid services). OGSA-WebDB is another mid- dleware tool based on OGSA-DAI that exposes Web databases via the OGSA-DAI interface. The prevalence of XML encoded data and Web-accessible resources means that it is desirable to extend the current func- tionality of OGSA-DQP to provide support for such resources. This paper presents an extension to OGSA- DQP that allow queries over relational, XML and Web databases wrapped by OGSA-DAI and OGSA-WebDB. An application is presented that illustrates how these features complement each other to provide data integra- tion and analysis in service-based Grids. Experimental results are presented that investigate the benefit of the approach within the application. 1 Introduction Data access and integration are important require- ments of many service-based Grid applications. Work in this area exists in the form of developed spec- ifications [1] and various middleware. OGSA-DAI [www.ogsadai.org.uk] is one such middleware compo- nent that aims to provide an implementation of data access and integration capabilities in service-based Grids. Building on top of OGSA-DAI is OGSA-DQP [www.ogsadai.org.uk/dqp], which provides Grid-based distributed query processing [9] over multiple OGSA- DAI wrapped databases. OGSA-DQP supports dis- tributed queries over OGSA-DAI-wrapped relational data sources that can be executed in parallel using a distributed architecture based on a set of evaluation services. This is enhanced by support for combining data access with data analysis by utilising analysis ser- vices, encapsulated by Web services, that are invoked by OGSA-DQP. In this paper it is demonstrated how OGSA-DQP can be a more effective tool with the addition of the fol- lowing two capabilities: (i) the ability to query OGSA- WebDB [http://dbgrid.org/OGSA-WebDB] wrapped data sources; (ii) the ability to query OGSA-DAI- wrapped XML data sources, construct XML results from relational data, and construct relational results from XML data. An application is presented that com- bines data integration and analysis, with the potential benefits of OGSA-DQP clearly illustrated: there is no need to write any data integration code and services are orchestrated automatically by OGSA-DQP. The remainder of this paper is structured as fol- lows. Section 2 discusses the topic of data integra- tion in service-based Grids and introduces OGSA-DAI, OGSA-DQP and OGSA-WebDB. Section 3 describes that extensions to OGSA-DQP that have been imple- mented to support Web databases and XML data. Sec- tion 4 presents the application scenario with which the effectiveness of the extensions is demonstrated together with an analysis of the performance of the system. Re- lated work is discussed in Section 5, and Section 6 presents some concluding remarks. 2 Data Integration in Service-based Grids Two key challenges inherent to data integration in service-based Grids are: how to expose data resources in a consistent way that enables interoperability be- tween different applications, organisations, infrastruc- tures etc. and how to make use of the capabilities of- fered by service-based Grids in order to facilitate data integration. A standardisation effort has been under- taken by the DAIS Working Group of the Open Grid Forum (OGF) (http://www.ogf.org), which has defined interfaces for accessing databases in a Grid setting [1]. 978-1-4244-2579-2/08/$20.00 © 2008 IEEE 9 th Grid Computing Conference 160

Upload: buianh

Post on 27-Mar-2017

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: [IEEE 2008 9th IEEE/ACM International Conference on Grid Computing (GRID) - Tsukuba (2008.09.29-2008.10.1)] 2008 9th IEEE/ACM International Conference on Grid Computing - Service-based

Service-based Data Integration using OGSA-DQP and OGSA-WebDB

Steven Lynden, Said Mirza Pahlevi and Isao KojimaNational Institute of Advanced Industrial Science and Technology (AIST), Japan

[email protected] [email protected] [email protected]

Abstract

OGSA-DQP is a service-based distributed query pro-cessor that is able to execute queries over data ser-vices and combine data integration with data analysisby invoking Web services. OGSA-DQP currently sup-ports only one type of data source, relational databaseswrapped using OGSA-DAI (a middleware tool that ex-poses XML or relational database management sys-tems as Grid services). OGSA-WebDB is another mid-dleware tool based on OGSA-DAI that exposes Webdatabases via the OGSA-DAI interface. The prevalenceof XML encoded data and Web-accessible resourcesmeans that it is desirable to extend the current func-tionality of OGSA-DQP to provide support for suchresources. This paper presents an extension to OGSA-DQP that allow queries over relational, XML and Webdatabases wrapped by OGSA-DAI and OGSA-WebDB.An application is presented that illustrates how thesefeatures complement each other to provide data integra-tion and analysis in service-based Grids. Experimentalresults are presented that investigate the benefit of theapproach within the application.

1 Introduction

Data access and integration are important require-ments of many service-based Grid applications. Workin this area exists in the form of developed spec-ifications [1] and various middleware. OGSA-DAI[www.ogsadai.org.uk] is one such middleware compo-nent that aims to provide an implementation of dataaccess and integration capabilities in service-basedGrids. Building on top of OGSA-DAI is OGSA-DQP[www.ogsadai.org.uk/dqp], which provides Grid-baseddistributed query processing [9] over multiple OGSA-DAI wrapped databases. OGSA-DQP supports dis-tributed queries over OGSA-DAI-wrapped relationaldata sources that can be executed in parallel using adistributed architecture based on a set of evaluation

services. This is enhanced by support for combiningdata access with data analysis by utilising analysis ser-vices, encapsulated by Web services, that are invokedby OGSA-DQP.

In this paper it is demonstrated how OGSA-DQPcan be a more effective tool with the addition of the fol-lowing two capabilities: (i) the ability to query OGSA-WebDB [http://dbgrid.org/OGSA-WebDB] wrappeddata sources; (ii) the ability to query OGSA-DAI-wrapped XML data sources, construct XML resultsfrom relational data, and construct relational resultsfrom XML data. An application is presented that com-bines data integration and analysis, with the potentialbenefits of OGSA-DQP clearly illustrated: there is noneed to write any data integration code and servicesare orchestrated automatically by OGSA-DQP.

The remainder of this paper is structured as fol-lows. Section 2 discusses the topic of data integra-tion in service-based Grids and introduces OGSA-DAI,OGSA-DQP and OGSA-WebDB. Section 3 describesthat extensions to OGSA-DQP that have been imple-mented to support Web databases and XML data. Sec-tion 4 presents the application scenario with which theeffectiveness of the extensions is demonstrated togetherwith an analysis of the performance of the system. Re-lated work is discussed in Section 5, and Section 6presents some concluding remarks.

2 Data Integration in Service-based

Grids

Two key challenges inherent to data integration inservice-based Grids are: how to expose data resourcesin a consistent way that enables interoperability be-tween different applications, organisations, infrastruc-tures etc. and how to make use of the capabilities of-fered by service-based Grids in order to facilitate dataintegration. A standardisation effort has been under-taken by the DAIS Working Group of the Open GridForum (OGF) (http://www.ogf.org), which has definedinterfaces for accessing databases in a Grid setting [1].

978-1-4244-2579-2/08/$20.00 © 2008 IEEE 9th Grid Computing Conference160

Page 2: [IEEE 2008 9th IEEE/ACM International Conference on Grid Computing (GRID) - Tsukuba (2008.09.29-2008.10.1)] 2008 9th IEEE/ACM International Conference on Grid Computing - Service-based

The specifications produced as a result of this effortaddress issues related to the design of the interfacesused to expose and access data resources, and the pro-vision of middleware for rapid deployment and buildingclient applications is addressed by OGSA-DAI. OGSA-WebDB and OGSA-DQP, which build upon the ca-pabilities offered by service-based Grid infrastructuresand OGSA-DAI to facilitate data integration, are nowdescribed in more detail.

2.1 OGSA-DAI, OGSA-DQP and OGSA-WebDB

OGSA-DAI is a middleware toolkit that enablesdata resources (such as relational databases, XMLdatabases and files) to be easily made available viaa service-based interface. Clients interacting with anOGSA-DAI data service are able to utilise various ca-pabilities associated with service-based Grids such assecurity and resource management, the interfaces towhich are consistent with Grid standards. OGSA-DAIprovides data retrieval and management functionalityin the form of activities, which are roughly analogous toobjects in programming languages. Activities exist forquerying, transforming, modifying and delivering data.Clients are able to submit requests that mix and matchactivities together, piping the output of one activity toanother and so on. Activities that support the query-ing of data resources using SQL, XQuery and XPathforward the query to the underlying database manage-ment system, and any data integration functionalityimplemented by activities is specified in a workflow-like manner rather than declaratively.

OGSA-WebDB adds Web-accessible databases tothe set of data resources supported by OGSA-DAI.OGSA-WebDB is based on a mediator-wrapper ar-chitecture that is capable of compiling and execut-ing queries over multiple Web databases. The me-diator component is implemented as an extensionto OGSA-DAI that allows OGSA-WebDB to co-exist with other OGSA-DAI data resources accessi-ble through the same interface. Wrappers are pro-vided for many popular Web databases, for exam-ple CiteSeer (http://citeseer.ist.psu.edu) and PubMed(http://www.ncbi.nlm.nih.gov/PubMed). Each wrap-per provides a relational schema modeling the wrappedWeb database and reconciles the differences betweenheterogeneous Web databases. The schemata providedby the wrappers are combined together by the mediatorinto a global relational schema that can be queried byclients using SQL. The mediator compiles and executesthe query, potentially optimising at runtime the orderof joins between data from multiple Web databases.

OGSA-DQP is a distributed query processing exten-

sion to OGSA-DAI that compiles, optimises and ex-ecutes queries over multiple OGSA-DAI-wrapped re-lational data sources. Like OGSA-WebDB, OGSA-DQP is based on a mediator-wrapper architecture withtwo salient differences: (1) the mediator component isbased on a distributed architecture consisting of a set ofWeb services and (2) the wrappers used by the OGSA-DQP mediator are OGSA-DAI data services. The ben-efits of a distributed evaluation infrastructure includeparallelism and OGSA-DQP performs a multi-phasequery plan optimisation process consisting of logical,physical and parallel optimisation stages. OGSA-DQPimplements query evaluation using a pipelined itera-tor model [3], which enables two kinds of parallelism:partitioned parallelism, where separate instances of anoperator exist on different nodes, and pipelined paral-lelism, where different segments of a data set can beprocessed simultaneously on different nodes.

3 OGSA-DQP Extensions

In this section two extensions to OGSA-DQP aredescribed. The first extension allows queries to becomposed over OGSA-WebDB-wrapped data sourcesand the second enables the manipulation of XML data.Both extensions are implemented in a way that is min-imally invasive and a slightly modified version of theevaluation infrastructure of OGSA-DQP is used. Thechanges required amount to: (i) the introduction of anXML data type to the OGSA-DQP type system; (ii) aset of new operators required to retrieve data from Webdatabases, scan XML databases and manipulate XMLdata; (iii) compiler/optimiser modifications required inorder to compile query execution plans which use thenew operators and provide reasonable query responsetimes.

3.1 Support for WebDB resources

The only changes required to the OGSA-DQP queryevaluation infrastructure must be made to those op-erators that directly retrieve data from remote datasources. OGSA-DQP has two operators that per-form this function: the TableScan operator and theHashLoopJoin operator. The HashLoopJoin opera-tor is a loop join that can be parallelised for execu-tion over multiple evaluators by partitioning input tu-ples using a hash function. When joining data fromtwo input operators (left and right inputs), the oper-ator works by building, for every left input tuple, aquery that is submitted to the right input data sourcein order to retrieve the relevant tuples that may po-tentially by joined with the left input tuple. The

161

Page 3: [IEEE 2008 9th IEEE/ACM International Conference on Grid Computing (GRID) - Tsukuba (2008.09.29-2008.10.1)] 2008 9th IEEE/ACM International Conference on Grid Computing - Service-based

table X table Y

HASH JOIN

HASH LOOP JOIN

webDB

TABLE SCAN

evaluator A

evaluator B

select *

from tableX, tableY, webDB

where tableX.attribute=tableY.attribute

and tableY.attribute=webDB.attribute

OGSA-DAI service

OGSA-DAI service

OGSA-WebDB service

evaluator service A

evaluator service B

OGSA-DQP service

web database

relational DBs

service interactions

parallel query plan

query

TABLE SCAN

Figure 1. OGSA-WebDB integrationThis diagram illustrates the interactions between services

and an example query plan that result from the submissionof a query over OGSA-DAI and OGSA-WebDB services.Dotted lines denote the boundaries between query planpartitions, that is groups of operators that are assigned

together for execution on a specific evaluator. The queryplan illustrates the method typically used by the

OGSA-DQP optimiser now that it has been modified tosupport OGSA-WebDB resources. A loop join is utilised

because a complete table scan of the Web database isimpossible.

TableScan operator scans data from an OGSA-DAI-wrapped data source by sending an SQL query to thedata source. The HashLoopJoin and TableScan oper-ators are modified so that they are capable of query-ing OGSA-WebDB-wrapped data sources in additionto OGSA-DAI-wrapped sources.

In addition to the changes to the evaluation infras-tructure described above, some modifications must bemade to the compiler/optimiser. OGSA-DQP’s opti-miser is based on a greedy heuristic algorithm thatorders joins based on cardinalities of relational tables(which are known as this information is provided bythe OGSA-DAI data sources) and predicate selectivi-ties (which are estimated). Optionally, the cardinalityof a Web database may be provided by the client duringthe schema import phase that precedes query submis-sion, however if this information is not provided, theoptimiser simply assumes that the cardinality of eachWeb database is very large compared to the cardinali-ties of any relational database tables.

Figure 1 illustrates the relationship between OGSA-DQP, OGSA-WebDB and OGSA-DAI services and thecompilation of an example query. A potential problemarises from the fact that it is usually impossible to re-trieve all the data from a Web database. Most Webdatabases provide an interface that requires at leastone boolean search term to be provided when a query

is submitted. As a result of this, the choice of physi-cal plans must be limited to those plans which do notscan a Web database without applying some predicate(which is mapped by OGSA-WebDB to a search term).If no such predicates can be applied, the optimiser mustuse the HashLoopJoin operator when performing joinswith Web databases as other joins such as HashJoin

cannot be used as the relation retrieved from the Webdatabase cannot be retrieved in its entirety. As a result,queries cannot be compiled if they require the retrievalof all the data within a Web database, for examplethe query “select * from web database” cannot beexecuted.

3.2 Support for XML data

In order to support XML data, the OGSA-DQPquery language is extended by first introducing anXML data type and secondly providing a set of func-tions for XML manipulation, conversion to/from XMLand retrieval of XML from OGSA-DAI-wrapped na-tive XML database collections. OGSA-DQP alreadysupports the execution of functions in the form ofWeb services for data analysis, and the XML func-tions are only different in the sense that instead ofinvoking an external Web service, the evaluation infras-tructure executes the XML function locally. Generally,the compiler/optimiser treats Web service invocationsand XML functions the same during query compila-tion which minimises the changes required within thiscomponent. The only difference in the way that Webservice operations and XML functions are treated dur-ing optimisation is that Web service operations may beparallelised (scheduled to run simultaneously on differ-ent evaluation nodes) whereas there is no benefit inrouting tuples to an additional evaluation node to exe-cute an XML function in parallel as the function is gen-erally not a computationally expensive operation. Thefunctions signatures are based on a mixture of thoseadded during the fifth revision of the SQL language(SQL:2003) and those used by major DBMS vendorsand other systems such as the Tukwila data integrationsystem [5]. As compliance with with SQL in this do-main is far from universal and there are various advan-tages to the non-SQL compliant functions introducedin certain systems, the range of functions chosen here isthe result of a mix and match from the aforementionedsources.

3.2.1 XML Functions

An overview is now given of some of the functionsadded to OGSA-DQP in order to support XML data.Here, a set of usage examples are provided for clarifica-

162

Page 4: [IEEE 2008 9th IEEE/ACM International Conference on Grid Computing (GRID) - Tsukuba (2008.09.29-2008.10.1)] 2008 9th IEEE/ACM International Conference on Grid Computing - Service-based

tion purposes, where it is assumed that (i) the followingtwo tuples are bound to the variable name name:id surname

---------------

1 Walker

2 Davies

and (ii), the following two tuples are bound to the vari-able name course:studentXML

---------------

<student>

<id>1</id>

<course>Database engineering</course>

<course>Discrete mathematics</course>

</student>

<student>

<id>2</id>

<course>Data mining</course>

<course>Machine learning</course>

</student>

The following is a non-exhaustive list of XML func-tions supported. Section 4 shows how these functionscan be used together to manipulate XML data in anOGSA-DQP query.

ExtractXMLValue (xml, string) returns

collection<xml>: This function uses XPath1.0 expressions to extract XML data. A single XMLtyped value is returned for each node matching thegiven XPath expression, for example:select ExtractXMLValue(studentXML,‘/student/id’)

from course

Results in two tuples consisting of a single XML field:<id>1</id>

<id>2</id>

ExtractXMLStringValue (xml, string) returns

collection<string>: Does the same thing asExtractXMLValue except that each result is convertedto a String value.

XMLElement (xml, string) returns xml: Thisfunction constructs one XML element node aroundan XML value. In addition to the XML value, thefunction takes a parameter specifying the name of theXML element to be created, for example:select XMLElement(ExtractXMLValue(studentXML,

‘/student/course’),’CourseSelection’

from course

Results in two tuples as follows:<CourseSelection>

<course>Database engineering</course>

<course>Discrete mathematics</course>

</CourseSelection>

<CourseSelection>

<course>Data mining</course>

<course>Machine learning</course>

</CourseSelection>

XMLAgg (string, collection<xml>) returns

xml: An aggregate function that creates an XMLelement with the name specified by the first parameterand nests inside this element the collection of XMLvalues provided by the second parameter, for example:select XMLAgg(‘Courses’,

ExtractXMLValue(studentXML,‘/student/course’))

from course

Results in one tuple as follows:<Courses>

<course>Database engineering</course>

<course>Discrete mathematics</course>

<course>Data mining</course>

<course>Machine learning</course>

</Courses>

XMLGen (string) returns xml: used to constructXML results from a template structure. The templatecan reference relational or XML tuple fields that areinserted into an XML fragment that is emitted onceper tuple, for example:select XMLGen(‘<student id="{$id}"

name="{$surname}"/>’) from name

Results in two tuples as follows:<student id="1" name="Walker"/>

<student id="2" name="Davies"/>

XMLOccurs (xml XML, xPathExpr string) returns

boolean: a boolean function that determines whetheran XPath expression can be successfully matched witha given XML value, for example:select studentXML from course

where XMLOccurs(studentXML,

‘/student/course/text()=’Data Mining’)

Results in one tuple as follows:<student>

<id>2</id>

<course>Data mining</course>

<course>Machine learning</course>

</student>

The functions listed above are those which are requiredin order to understand the query constructs that ap-pear in the next section.

4 Application

This section presents an application that illustrateshow the extended OGSA-DQP can meet the dis-tributed data analysis and integration requirements ofa UK e-Science project, eXSys [4], which focused onanalysis of the interactions in complex systems such as

163

Page 5: [IEEE 2008 9th IEEE/ACM International Conference on Grid Computing (GRID) - Tsukuba (2008.09.29-2008.10.1)] 2008 9th IEEE/ACM International Conference on Grid Computing - Service-based

social, biological and ecological networks. One suchapplication used graph theoretic analysis algorithmsbased on properties shared by many biological inter-action networks to identify target proteins for pharma-ceutical drugs in intracellular protein interaction net-works. In this application scenario a database of pro-tein interactions is available as the result of experi-mental activity and the collection of published pro-tein interaction data in various literature. From thisdata a protein interaction network is constructed fora specific organism and encoded as an adjacency ma-trix that is subsequently analysed in order to deter-mine proteins that are of high importance to the in-tegrity of the network based on their relative positionin the network alone. Finally, proteins flagged as be-ing important by the analysis algorithms are annotatedwith further data from Web-accessible Bioinformaticsresources. Basically, the idea is to identify, from a verylarge set of proteins, a smaller set of potentially in-teresting proteins that can be feasibly investigated indetail by a human expert. Firstly, data is retrievedfrom relational database tables to form an undirectedgraph structure representing protein interactions for aspecific organism. In this graph each node is a proteinand a link between two proteins represents an inter-action between them. Next, an analysis program isexecuted with the graph structure created in the firststep as input. Each protein is ranked according to itspredicated importance to the structure of the proteininteraction network. Finally, those proteins that areranked as being relatively important to network in-tegrity are annotated with additional information thatcan be found in Web database sources such as UniProt(http://www.pir.uniprot.org).

Assuming that the necessary data and analysis ser-vices are available in a way that is compatible withOGSA-DQP (the databases sources are wrapped us-ing OGSA-DAI/OGSA-WebDB and the analysis pro-gram is exposed as a Web service) it is now demon-strated how this application can be managed usingthe extended OGSA-DQP. The client submits a queryand OGSA-DQP compiles, optimises and executes thequery without the client needing to specify any detailsabout how this is done. The scenario involves an anal-ysis service named ‘evaluate’ which accepts an XMLrepresentation of an interaction network and returns aresult document which ranks the significance of eachprotein in the range [0-1] and encodes the results inXML. The following example illustrates how the in-put network is encoded as a set of links (left) and theresults encoded as a set of significance scores (right)

when invoking the analysis service:<links> <nodes>

<link to="P39958" <node name="P39958"

from="P11991"/> value="0.071"/>

<link to="P20338" <node name="P20338"

from="P39958"/> value="0.004"/>

<link to="P14922" <node name="P14922"

from="P11202"/> value="0.03"/>

</links> </nodes>

Two relational database tables exist, species and in-teraction, along with the UniProt Web database whichis wrapped using OGSA-WebDB. This results in theglobal relational schema compiled by OGSA-DQP:

Table name Field name type

interaction interaction id varchar

nodea varchar

nodeb varchar

species interaction id varchar

species varchar

uniprot protein name varchar

name varchar

organism varchar

gene varchar

The interaction table defines a set of interactions be-tween two proteins and the species table maps inter-actions to the species in which they occur. The in-teraction id field can be used to join the species andinteraction tables. The nodea and nodeb fields of theinteraction table identify proteins using Swiss-Prot ac-cession numbers. The protein name field from UniProtcan be queried using Swiss-Prot accession numberswhich allows a join between tuples from the interactiondatabase using the nodea and nodeb fields. The appli-cation is implemented by the following query, whichselects protein interactions from a given species, anal-yses them and retrieves more information on proteinswhich are of a significantly high importance within aprotein interaction network. Query-A:select analysis.pn, uniprot.gene from uniprot,

(

select ExtractXMLStringValue(xml,‘//@name’) as pn

from (

select ExtractXMLValue(

evaluate( XMLElement( XMLAgg( XMLGen(‘

<link from=\"{\$interaction.nodea}\"

to=\"{\$interaction.nodeb}\" />’)),

‘links’ ) ), ‘//node’

) as xml

from interaction, species

where species.interaction_id=

interaction.interaction_id

and species.name=‘ORGANISM’

)

where XMLOccurs(xml,‘//*[@value>0.1]’)=true

) as analysis

where analysis.pn=uniprot.protein_name

164

Page 6: [IEEE 2008 9th IEEE/ACM International Conference on Grid Computing (GRID) - Tsukuba (2008.09.29-2008.10.1)] 2008 9th IEEE/ACM International Conference on Grid Computing - Service-based

species interactions

HASH JOIN

XMLGen, XMLAgg,XMLElement

Web Service OperationCall

ExtractXMLValue, XMLOccurs,ExtractXMLStringValue

HASH LOOP JOIN

UniProt

TABLE SCANTABLE SCAN{mm,hs,

sc}=13145mm=152

hs=1005

sc=11860

mm=63 hs=1020

sc=11873

mm=91

hs=229

sc=169

mm=78

hs=198

sc=165

Figure 2. Physical query planThe physical query plan produced by the optimiser for

Query-A. Circle denote an operators; for operators thathave a different input cardinality to their output

cardinality, the number of tuples produced by operators isshown for each of the three organisms involved in the

experiment, denoted as mm (mus musculus), hs (homosapiens) and sc (saccharomyces cerevisiae). Projection

operators are omitted to simplify the diagram.

Where ORGANISM is substituted for one of the organ-isms present in the species database, for example ‘sac-charomyces cerevisiae’, ‘homo sapiens’ etc. This querycan be expressed as ‘return the ID and gene name ofall proteins belonging to the organism’s protein interac-tion network that have a predicated significance valuegreater than 0.1’. This query results in the physicalquery execution plan illustrated in Figure 2, which isannotated with the number of tuples output by oper-ators at each stage in the query plan. The propertiesof the parallel query plan produced by the optimiserare dependent on the available resources (in particu-lar, evaluation services) available to execute the query.

4.1 Experiment

The aims of the experiment presented here are toanalyse the performance of the extensions to OGSA-DQP and to see if the application presented in thispaper can benefit from OGSA-DQP’s ability to paral-lelise the execution of a query. Experiments are per-formed using 2GHz AMD Opteron machines with 6GBof RAM running SuSE Linux connected via a 100MBsEthernet LAN. OGSA-DAI WS-I 2.2 and the OGSA-DQP 3.2 Tech preview releases are used. Two OGSA-

DAI data services are deployed each exposing one of therelational tables (species and interaction), an OGSA-WebDB data source is deployed providing access to theUniProt Web-accessible database, and finally the pro-tein interaction network analysis Web service is alsodeployed. Services, whether they by data, analysis orevaluation services, are all deployed on different hostson the same LAN, in other words, no two servicesare deployed on the same node. Experiments are per-formed by issuing a query first with one evaluator avail-able and subsequently making another evaluator avail-able and repeating the experiment until 5 evaluatorsare used. This whole process is repeated three timesusing Query-A with three different organisms: “musmusculus”, “homo sapiens” and “saccharomyces cere-visiae”. In each case, the response time of the queryis measured and the query execution plan produced bythe optimiser is recorded.

4.2 Results

Figure 3 shows the query response times as the num-ber of evaluation nodes (i.e. the level of parallelism) isincreased. Response times vary significantly depend-ing on the organism involved due to the difference inthe number of proteins present in each case (refer toFigure 2). It can be seen that increasing the numberof evaluators has a positive effect and reduces queryresponse time, in particular when the number of eval-uators is increased from two to three.

In order to analyse the results of the experiment,it is necessary to briefly discuss the decisions made bythe optimiser when choosing a parallel execution planfor Query-A. Following the generation of a physicalquery plan, the optimiser identifies candidate opera-tors for partitioned parallelism and estimates their costin terms of execution time. Candidates for partitionedparallelism include hash table based join operators andoperators that execute invoked Web services. In thephysical query plan for Query-A (Figure 2), there areonly two candidates, the two join operators. Web ser-vice invocations are only parallelised if there are mul-tiple copies of the service to invoke, otherwise the op-timiser assumes there is no reduction in response timeto be gained by parallelising the operator. The opti-miser takes the operator that is estimated to be themost costly parallelisable candidate operator and in-creases the level of parallelism if more evaluators areavailable to execute the operator. This process is re-peated until there is either no benefit in parallelisingthe most costly candidate or there are no free eval-uators left onto which operators may be parallelised.Following this, the optimiser assigns operators to eval-uators with the aim of decreasing communication costs.

165

Page 7: [IEEE 2008 9th IEEE/ACM International Conference on Grid Computing (GRID) - Tsukuba (2008.09.29-2008.10.1)] 2008 9th IEEE/ACM International Conference on Grid Computing - Service-based

0

50

100

150

200

250

1 2 3 4 5

resp

onse

tim

e (s

econ

ds)

number of evaluators

mus musculushomo sapiens

saccharomyces cerevisiae

Figure 3. Response times for Query-AGraph plotting number of evaluators against response time

for the query involving three different organisms

In our experiments, the aim was to investigate theinteraction between OGSA-DQP and OGSA-WebDBand therefore the optimiser’s internal parameters wereweighted in favour of parallelising the join with theWeb database. The optimiser therefore chooses to par-allelise the join involving the Web database as much aspossible in order to try to lower response time. As it isa relatively slow operation, the access time for OGSA-WebDB-wrapped data sources can dominate responsetime and therefore other approaches may be consideredin future work to alleviate this problem. One possibleapproach is to enhance OGSA-DQP with the ability toinvoke the wrappers used by OGSA-WebDB directly,eliminating the need to use the OGSA-WebDB medi-ator and therefore lowering the response time signifi-cantly.

The decisions made by the optimiser are the sameregardless of the organism used in the query – when twoevaluators are made available, the optimiser assignsone join operator to each of them. Subsequently, asthe number of evaluators increases, the HashLoopJoin

operator is scheduled for execution on as many evalua-tors as possible as illustrated in Figure 4. Therefore, allthe improvement in query response time is gained bythe parallelisation of the join with the OGSA-WebDB-wrapped UniProt database. Pipelined parallelism issupported by OGSA-DQP, however this does not af-fect response times significantly here as can be seenwhen the number of evaluators is increased to two.The most noticeable decrease in query response time(for all organisms) occurs when the number of avail-able evaluators is increased from two to three, whichcan be attributed to the first partitioned parallelismof the HashLoopJoin operator. The effect of further

species interactions

HASH JOIN

XMLGen, XMLAgg, XMLElement

Web Service Operation Call

HASH LOOP JOIN

TABLE SCAN TABLE SCAN

Node 1

Node 2, ..., Node N

UniProtExtractXMLValue, XMLOccurs,ExtractXMLStringValue

Figure 4. Parallel query plans generated toevaluate Query-A

When more than two evaluators are available, the hashloop join operator is parallelised - separate instances of the

operator exist on multiple nodes.

parallelism of the operator is much less, however theexperiment has illustrated some benefit with regardsto response time provided by the parallel and dis-tributed evaluation infrastructure when executing thequery. An important additional benefit also exists inthat the parallelism of join operators distributes mem-ory usage over multiple nodes. This may be of usein particular with queries that retrieve data from largeWeb databases and join this with other data, where ex-ecuting the join on a single node may lead to runningout of memory.

5 Related Work

Work on exposing data in Grids focuses on the devel-opment of techniques that can complement Grid char-acteristics, i.e. a distributed service-based architecture.Work has progressed in the development of separatespecifications for exposing XML, relational and RDFdata within the domain of the DAIS working group [1].The interfaces defined by DAIS are currently differentfrom the ones supported by OGSA-DAI (and thereforeOGSA-DQP and OGSA-WebDB), but some conver-gence is taking place in this area and OGSA-DAI hasstarted to release implementations that support DAISinterfaces.

166

Page 8: [IEEE 2008 9th IEEE/ACM International Conference on Grid Computing (GRID) - Tsukuba (2008.09.29-2008.10.1)] 2008 9th IEEE/ACM International Conference on Grid Computing - Service-based

ObjectGlobe [2] is similar to the work presented inthis paper as it can execute distributed queries overthe Internet, however ObjectGlobe isn’t built uponthe capabilities offered by Grid infrastructure as areOGSA-DQP and OGSA-WebDB. The same distinc-tion applies to declarative data integration and anal-ysis efforts such as SkyQuery [7]. GridDB [6] sup-ports data integration orchestrated by workflows ex-pressed as functional programs, which differs from thedeclarative approach of OGSA-DQP/OGSA-WebDB.GridDB-Lite (aka STORM)[8] is perhaps a compro-mise between these two approaches where a variationof the SQL language is used that contains constructsthat can be used to specify how a query is parallelisedusing a set of processors.

Major database vendors now offer Grid-themedfeatures including support for clustered serversand distributed data resources in products suchas IBM Information Integrator (http://www-306.ibm.com/software/data/integration/) and Oracle11g (http://www.oracle.com). The work presented inthis paper offers a far less expressive query languageand less sophisticated optimisation strategies thanthese products but is based on an distributed Webservice based architecture (consisting of OGSA-DQPand OGSA-WebDB services), which is built on anexisting open source platform (OGSA-DAI).

6 Conclusions

It has been demonstrated how OGSA-DQP andOGSA-WebDB are complementary tools which can beused to provide capabilities for data integration inservice-based Grids. With the addition of support forXML, OGSA-DQP was shown to be capable of opti-mising and executing a similar process to that of anexisting e-Science application. The advantages of usingthe extended OGSA-DQP in conjunction with OGSA-WebDB in this scenario are that here is no need to con-struct any application specific data integration compo-nents, there is no need to specify a workflow or de-scribe the execution of the process (the client spec-ifies the desired result though a declarative expres-sion of the desired result), and OGSA-DQP uses theservice-based Grid infrastructure to efficiently executethe query - aspects such as scheduling are handled au-tomatically and it was shown that parallelisation usinga distributed evaluation infrastructure could improvequery response times. Software described in this paper(OGSA-WebDB and the extensions to OGSA-DQP) isavailable at http://dbgrid.org.

Acknowledgements The authors would like tothank Masahiro Kimoto for his work on developingwrappers for OGSA-WebDB. Discussions with variouspeople associated with OGSA-DQP at the Universityof Manchester, in particular Norman W. Paton andAlvaro A.A. Fernandes, are acknowledged. This workwas made possible by the support of the Japan Societyfor the Promotion of Science (JSPS).

References

[1] Mario Antonioletti, Amy Krause, Norman W.Paton, Andrew Eisenberg, Simon Laws, SusanMalaika, Jim Melton, and Dave Pearson. The ws-dai family of specifications for web service data ac-cess and integration. SIGMOD Rec., 35(1):48–55,2006.

[2] R. Braumandl, M. Keidl, A. Kemper, D. Koss-mann, A. Kreutz, and S. Seltzsam ans K. Stocker.ObjectGlobe: Ubiquitous Query Processing on theInternet. VLDB Journal, 10(1):48–71, 2001.

[3] G. Graefe. Encapsulation of Parallelism in the Vol-cano Query Processing System. In Proc. SIGMOD,pages 102–111, 1990.

[4] O. Idowu, S. Lynden, and P. Andras. e-Science Tools for the Analysis of Com-plex Systems. In Proc. e-Science AllHands Conference, pages 320–325, 2004.http://www.allhands.org.uk/2004/proceedings/.

[5] Z. Ives, A. Halevy, and D. Weld. IntegratingNetwork-Bound XML Data. IEEE Data Engineer-ing Bulletin, 24(2)), 2001.

[6] D.T. Liu and M.J. Franklin. GridDB: A Data-Centric Overlay for Scientific Grids. In Proc.VLDB, pages 600–611. Morgan-Kaufmann, 2004.

[7] T. Malik, A.S. Szalay, T. Budavari, and A.R.Thakar. SkyQuery: A Web Service Approach toFederate Databases. In Proc. CIDR, 2003.

[8] S. Narayanan, T.M. Kurc, and J. Saltz. DatabaseSupport for Data-Driven Scientific Applications inthe Grid. Parallel Processing Letters, 13(2):245–271, 2003.

[9] J. Smith, A. Gounaris, P. Watson, N. W. Paton,A. A. A. Fernandes, and R. Sakellariou. DistributedQuery Processing on the Grid. In Proc. Grid Com-puting 2002, pages 279–290. Springer, LNCS 2536,2002.

167