Transcript
Page 1: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - Syntactic

A Syntactic Rule Based Approach to Web Service Composition

Ken PuUniversity of [email protected]

Vagelis HristidisFlorida International University

[email protected]

Nick KoudasUniversity of Toronto

[email protected]

Abstract

This paper studies a problem of web service composi-tion from a syntactic approach. In contrast with other ap-proaches on enriched semantic description such as state-transition description of web services, our focus is in thecase when only the input-output type information from theWSDL specifications is available.

The web service composition problem is formally for-mulated as deriving a given desired type from a collec-tion of available types and web services using a prescribedset of rules with costs. We show that solving the minimalcost composition is NP-complete in general, and presenta practical solution based on dynamic programming. Ex-periements using a mixture of synthetic and real data setsshow that our approach is viable and produces good results.

1 Introduction

Several efforts, including the Web Service ConversationLanguage (WSCL), the Business Process Execution Lan-guage for Web Services (BPELWS), and the integration ofweb service calls on XQuery [10] address the issue of webservice composition. A natural problem is to automate thecomposition of web services. Towards this goal, a large cor-pus of work has focused on composing semantically rich(semantically annotated) web services. For instance, in[2, 5, 1] web services are modeled as finite-state machinesthat reflect their functionality, and their compositions aremodeled as compositions of state machines. Such a seman-tically rich description of the web services allows discoveryof high-quality compositions [2] as well as verification ofexisting ones [5]. However, WSDL, which is the currentlywidely adopted standard for description of web services, of-fers limited syntactic description of the services. With onlysuch limited information on the input-output types of theweb services, is it possible to automatically produce com-positions to perform specific tasks?

In this paper, we propose a syntactic approach to webservice composition, given only their WSDL descriptions.In particular, we view the web services as black boxes ca-pable of transforming XML fragments of the input type toXML fragments of the output type, and hence the prob-

lem of web service composition becomes a type-derivationproblem as follows: Given a collection T0 of base types, de-scribed in XML Schema, a collection S of web services, anda target type t, derive t from T0 using the services in S.

The target type is derived by applying a set of derivation-rules, or simply rules. These rules, which we will presentin detail, describe the ways in which new types are derivedfrom existing ones by application of web services. The tar-get type is derived from the base types by a series of ap-plications of these rules, which include sequential or paral-lel invocation of web services, and iteration over contain-ers. Each step of a derivation (application of a rule) incursa cost according to a cost model, which generally favorsderivations with fewer rule applications (i.e., fewer struc-tural transformations and web service applications).Example: Consider the type books which is a collection ofbook elements. Each book element is identified either by itstitle or its ISBN number, along with the author names andthe publisher.

books : book[1,∞] :

⎡⎢⎢⎢⎣

union

[title : string orISBN : string

authors : author[1,∞] :

[fname : stringlname : string

publisher : string

Suppose that we have two web services described as fol-lows:

WS1 :

[title : stringpublisher : string →

[bestprice : numericbookstore : string

WS2 : ISBN : string→

[bestprice : numericbookstore : string

Suppose that one wishes to find the best prices of all thebooks in a collection (of type books). The target type is:

bestprices : bestprice[1,∞] :

[bestprice : numericbookstore : string

One can manually verify that the target type bestprices

can indeed be derived from the collection of base types T0 ={books} and web services S = {WS1, WS2}. The derivation is notunique, but a sensible one is the following.

Derivation Plan:1. Iterate through the collection books,2. if it is identified by its title, then3. pass the title and publisher information to WS1,4. else, if it is identified by its ISBN, then5. pass the ISBN number to WS2.6. Combine the outputs and form the collection bestprices.

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

Page 2: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - Syntactic

The rules proposed in this paper support this derivationas we provide rules for iteration (line 1), if-then-else on atype (line 2 and 4), type-construction and destruction (lines3, 5 and 6), and application of web services (lines 3 and 5).We use a cost model to eliminate non-sensible derivations(such as using the author’s last name as the publisher, andso on).

Naturally, the quality of the generated candidate compo-sitions depends on the choice of the derivation rules. The setof rules used in this paper, which are shown to be sound ex-perimentally, are based on previous work on schema match-ing, type derivation and tree edit distance, which are ar-eas related to this work as we explain in Section 2. How-ever, our algorithmic framework is generic and a differentset of rules can be specified according to the applicationneeds 1. Our case studies show that one can still find use-ful derivations using real-life WSDL files. In this paper, wepresent our theoretical and algorithmic results on the type-derivation problem.

2 Related Work

Tree edit distance works [16, 21] (see [3] for a sur-vey) define a minimal set of operations (typically add node,delete node, and relabel node) to tackle the problem oftransforming a labeled tree T1 to a labeled tree T2 apply-ing a minimum sequence of operations. Our problem ismore complex since the labels of an XML Schema tree carryadditional information (e.g., list, union, complex type, tagname, base-type, and so on) which makes plain relabelinginapplicable. For example, if two XML Schema trees differonly on the label of a single node, whose label is “union”for the first tree and “list” for the second, then they haveunit distance in tree edit distance, but they are incompatiblein our framework. Tree matching works [4] have the samelimitation.

Dong et al. [6] propose a method to discover similarWeb Services based on the textual descriptions (in WSDL)and their input/output parameters’ names. It does not dealwith composition however. Paolucci et al. [11] describe aframework to semantically match Web services describedusing DAML-S. Recent work has tackled the problem ofcomposing web services by exploiting the flow diagramsthat represent them. In particular, web services are viewedas state machines [20, 18, 1, 2].

Wombacher et al. [18] determines if a composition isvalid by the intersection of the corresponding state ma-chines. Furthermore, since the web services are modeled astransition structures, one can verify the web service compo-sition by simulation [9] or by model-checking methods [5].Such works, which are complementary to our approach, as-sume access to the behavioral characteristics of the servicewhich may or may not be available; we only assume ac-cess to the service type signature which is typically avail-able through WSDL (and XML Schema). In SWORD [12],

1For instance, if we want to create more conservative compositions, wecan remove the optional rule (described in Figure 3).

web services are treated as input-output functions, and a setof rules of composition is specified. In our work, we alsotreat web services as functions, but work with a more com-plex set of rules that are capable constructing more sophis-ticated composition plans.

Schema matching tackles the problem of finding corre-spondences between the elements of two given schemas.Much work has been conducted on matching relationalschemata [14], but recently there has also been work onmatching of semi-structured schemata [13, 17, 15, 7, 8].These algorithms match two schemata by matching theirtree structures, exploiting possible semantic constraints[7, 17, 15]. In [8], the management of these mappings isconsidered. In contrast, our work tackles the problem oftransforming a set of base types (schemata) to a target typeusing the available web services as transformation tools.The web services can be invoked in various ways duringthe transformation process according to a set of derivationrules.

3 The Optimal Type-Derivation Problem

In this section, we formally define the optimal type-derivation problem with respect to an arbitrary set of deriva-tion rules. Then, we introduce a set of XML-centric ruleswhich are used to capture the ways in which XML Schemascan be modified, combined, and web services invoked, inthe derivation process. These rules naturally correspond toprogramming constructs found in BPEL, such as sequentialand parallel invocations, iteration over lists, and branch-ing based on node-types. We also included limited struc-tural transformations, which can easily be implemented byXQuery, as part of the rule-set. Derivations using these rulescorrespond to compositions of web services coupled withpossible structural transformations.

We represent an XML Schema as a tree of labeled nodes.The fragment of XML Schema we consider contains com-plex types of sequence, choice, and elements can haveminOccurs and maxOccurs constraints. As a labeled tree,each node can be a tag name, a complex type construc-tor (sequence, union), or a primitive type such as int orstring. A union-node represents a choice-complex type,that is, the instance of which can only be one of the chil-dren types in accordance with the XML Schema specifica-tion. Each node may optionally have a multiplicity modifier[m,n] indicating that in the instance, its occurrence is be-tween m and n inclusively – this corresponds to the minOc-curs and maxOccurs constraints in XML Schema. We calla node optional if it is either a union-node, or a node withminOccurs = 0 constraint. Types can, thus, be representedby terms such as UNION(title(string), ISBN(string)). A web ser-vice is simply a multi-input-multi-output function of theform f : s1,s2, · · · ,sn → t1, · · ·tm where si and ti are com-plex types.

PROBLEM FORMULATION: Let Ai and B be sets of types,called environments, and si and t types. We write Ai � si tomean that the type si can be derived from the environment

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

Page 3: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - Syntactic

Ai. Intuitively, an environment Ai is the collection of avail-able types, and Ai � si is to say that, by some means, we canderive si from Ai. A derivation-rule, or simply a rule, hasthe form

A1 � s1 · · · An � sn

B � tr.

It reads that if si can be derived from the environment Ai,then t can be derived from the environment B. The rule canbe written as r(B � t) to indicate that r derives t from theenvironment B. A rule is grounded if it requires no Ai � si,written /0

B�t . We further assume that each rule r carries a costgiven by a generic cost function cost(r(B � t)). Derivationrules can be composed in the natural way to form deriva-tions.

Formally, a derivation D(B � t) is defined as,

• any rule r(B � t) is a derivation, and

• if r(B � t) is a rule of the formA1 � s1 · · · An � sn

B � tr,

and Di(Ai � si) are derivations, then

D1(A1 � s1) · · · Dn(An � sn)

B � tr is a derivation.

• Nothing else is a derivation for B � t.

A derivation D(B � t) = {Di(Ai�ti)}iB�t r can be viewed as a tree

of rules, with the root being the rule Ai�tiB�t r, and sub-trees

Di(Ai � ti). It says that type t is derived from the environ-ment B by the derivation D. The derivation is groundedif its leaf-rules are all grounded. The cost for a derivationD(B � t), written ‖D(B � t)‖ is defined as,• if D(B � t) = r(B � t), then ‖D(B � t)‖= cost(r(B � t)),

• if D(� t) =D1(A1�s1) ··· Dn(An�sn)

B�t r, then

‖D(B� t)‖= cost(r(B� t))+g(‖D1(A1 � s1)‖, . . . ,‖Dn(An � sn)‖),

where g(· · ·) is either sum, or maximum 2 depending on therule r.

Definition 1. The optimal type derivation problem is: givena rule set R, a cost function cost, an environment A and atype t, find a grounded derivation D(A � t) with the minimalcost ‖D(A � t)‖.

The classical tree-edit distance problem [3] is a specialinstance of the optimal type derivation problem where theenvironment contains the source tree, and the derived typeis the target tree. There are three rules: insert-, delete-and replace, each of unit cost. Tree edit-scripts corre-spond to derivations where the cost is always additive. Thederivations, as defined, generalize tree-edit scripts in vari-ous ways. Derivations, in general, edit multiple trees, ortypes, in the environment to derive the new tree, and the

2In most cases, the cost of a derivation is the sum the costs of the sub-derivations and the applied rule. However, as defined in Figure 3, if theapplied rule is union-rule, then we take the maximum of the costs of thesub-derivations.

rules allowed are a richer set that includes the basic tree-edit operations, but also ways of invoking web services.

From this point, we refer to types in the environment asthe base-types, and the type to be derived as the target type.

3.1 Derivation rules for XML Schema

We use rules to express ways in which XML-documents(more accurately XML Schemas) can be transformed, com-bined and evaluated by web services. The set of rules in-clude basic structural transformations (similar to the clas-sical tree-edit operations), merging structures (substitutingparts of a schema document with parts from other schemadocuments), and finally applying web services (evaluatinga tuple of documents using available web services). In thisframework, the web service composition problem is for-mally posed as a type-derivation problem in which the base-types are the available inputs, and the target type is the de-sired output. A derivation of the target type is then a plan ofhow the available web services can be composed, togetherwith possibly some structural modifications to the interme-diate results, to produce the desired output.

/0A � t

MEM, if t ∈ A.A1 � t1 A2 � t2 · · ·Ai � tn

i Ai � c(t1, t2, . . . ,tn)CON

A � tA � t|p

DES, where p is not under an optional node.

A � tA � t|p̄

DELA � t A � s

A � t←p

sINS

A � t A � sA � t[s]p

REP

Figure 1. Structural transformations

BASIC STRUCTURAL TRANSFORMATION RULES: Thefirst set of rules deals with structural transformations. Werefer to this set of rules as the structural rules, shown inFigure 1.Membership: If type t is part of the environment, then it canbe derived from the environment by the membership-rule(MEM).Construct: One can derive new types from existingtypes by introducing new tags. Suppose one hasderived ISBN(string) and city(string), then cer-tainly, we should be able to construct a new elementX(ISBN(string), city(string)) by introducing thenew tag X. This is captured by the construction-rule (CON)shown in Figure 1.Destruct: Contrary to the construct-rule, suppose thatwe have the element sequence(title(string),ISBN(string)), then it is reasonable to deriveISBN(string) from it. Denote the sub-type (i.e., thesub-tree) of type t located from the root along a path p byt|p. The destruction-rule (DES) derives the sub-type t|p oftype t as long as t|p is not under some optional node. Thereason that we do not allow applications of the destruct-ruleat a sub-type under an optional node is simply that at theinstance level, that sub-type may not be present. These

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

Page 4: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - Syntactic

/0A � address(· · ·)

MEM

A � postal(string)DES

/0A � person(age(int),name(string))

MEM

/0A � address(· · ·)

MEM

A � city(string)DES

A � person(age(int),city(string))REP

A � office(postal(string), person(age(int), city(string)))CON

Figure 2. A derivation plan.

sub-types are accessible by the complex rules presented inFigure 3.Delete: The delete-rule (DEL) removes a sub-tree from atree.Insert: Similarly, one can insert a sub-type under a type asdescribed by the insert-rule (INS). Denote the result of in-serting type s into type t under the node located at path pas t ←

ps. If one can derive t and s, then application of the

insert-rule derives t←p

s.

Replace: The replace-rule (REP) allows one to replace a sub-type at path p of a type t with another derived type s, writtent[s]p.

This set of rules is not minimal in the sense that, forinstance, the replace-rule is equivalent to composition ofinsert and delete-rules. However, depending on the costfunction, it may be cheaper to replace rather than deleteand then insert.

Example: Suppose the set of base-types is

A =

{address :

[city : stringpostal : string , person :

[age : integername : string

}

And the target type t is office :

⎡⎣ postal : string

person :

[age : integercity : string

One derivation D(A� t) is shown in Figure 2. In the plan,the target type is derived by replacing the sub-tree nameunder person with the sub-tree city which is derived bydestructing the type address. Together, with the sub-treepostal, one uses construction to derive the final type. �

Note that the rules do not pay attention to the sequence ofthe children nodes. We actually ignore the order among thechildren nodes in the type, even though in general, XMLSchema is order sensitive. The alternative is to introducean additional structural rule, permute, which re-orders thechildren nodes under a given parent node. In this paper, wedo not consider permutation, thus consider two nodes equalif their children are equal upto permutation.

DEALING WITH WEB-SERVICES AND XML-SCHEMA:Clearly, structural rules are not suitable to lead to automa-tion of web service compositions: they do not deal withimportant parts of XML Schema, in particular, union nodesin the type, and minOccurs and maxOccurs constraints ofelements. Furthermore, they do not capture web services.Thus, we extend the rule set (Figure 3) to include rulesdealing with union’s, web services and finally dealing with

A � union(t1, t2, · · ·) A∪{ti} � t for all tiA � t

UNI

A � s1 · · ·A � sm A∪{t1, t2 · · ·tn} � t

A � tAPP( f )

where there is a web service f : s1,s2, · · ·sm→ t1, · · · , tn.

A � s[m,n] A∪{s} � t

A � t[m′,n′]MAP where m′ ≤ m,n′ ≥ n.

/0A � s[0,n]

OPTN,

sA � s[m,n]

LIST,m≤ 1

A � c(t1, t2, · · · , tn) A � c(s1,s2, · · · ,sm)

A � c(u1, . . . ,um+n)CAT

where u1, . . . ,um+n is an ordering of t1, . . . ,tn,s1, . . .sm

Figure 3. More complex XML-centric rules

minOccurs, maxOccurs constraints.Union: The union-rule (UNI) reads as: if A can deriveunion(t1, t2, · · ·) and for all i ≤ n, ti together with A canderive t, then, A can derive t. Consider the following exam-ple: Let A = {c(union(a(b),b))}, and t = b. There is onlyone base-type. We cannot derive the target t from the base-type by structural-rules since destruction cannot be appliedbelow a union-node. However, using the union-rule, onecan derive t:

A � c(union(a(b),b)), A∪{a(b)} � b, A∪{b} � b

A � bUNI

Apply: The apply-rule (APP) allows one to use an availableweb service to transform the tuple of input schemas to theoutput schema. It reads, if f : s1,s2, · · · → t1, t2, . . . is anavailable web service, and from the environment A, si arederivable, and A plus {t j} can derive t, then we can derive tfrom A using the web service.

For example, a web servicegetPrice : book,ISBN[0,1] → price, where the numbers inbracket denote minOccurs and maxOccurs respectively,creates a rule

A � book A � ISBN[0,1] A∪{price} � tA � t

APP.

Map, Optional and List: The rules map (MAP), optional(OPTN) and list (LIST) deal with elements with minOccursand maxOccurs constraints. MAP allows one to map a listof elements of type s into a list of elements of type t giventhat t can be derived from s and the other base-types, andthat the multiplicity constraint on t is more relaxed than the

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

Page 5: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - Syntactic

Let A = {books(book[1,∞])}.

/0A � books(book[1,∞])

MEM

A � book[1,∞]DES

/0A∪book � book

MEM/0

A∪book � ISBN[0,1]OPTN

/0A∪book∪price � price

MEM

A∪{book} � priceAPP

A � price[1,∞]MAP

Figure 4. A derivation using apply, optional and map-rules.

constraint for s. OPTN and LIST allow the creation of an ele-ment with multiplicity constraints from one without.

An example of using some of these rules alongwith the web service getPrice described above to de-rive the target type t = price[1,∞] from the base-type{books(book[1,∞])} is shown in Figure 4.Catenate: CAT is the catenate-rule which allows catenationof children of two nodes if they have the same label.

COSTS OF DERIVATIONS: The cost of each rule is givenby a generic cost function cost. The cost function also de-termines the overall cost of a derivation. Given a deriva-tion D = D1,D2,···Dn

A�t r, where D1, . . . ,Dn are also derivations,if the rule r is not the union-rule, then ‖D‖ = cost(r) +∑n

i=1 ‖Di‖. But in the case of the union-rule, we define‖D‖= cost(r)+max{‖Di‖ : 1≤ i≤ n}.

THE COMPLEXITY OF TYPE-DERIVATION: The com-plexity of the type-derivation problem depends on thederivation rules we consider. In general, the problem isintractable. The source of intractability lies on the XML-centric rules in Figure 3.

Theorem 1. If we consider only the union- and apply-rules,the decision problem of type-derivation is coNP-hard, andif we consider only the catenate-rule with constant cost, theoptimal type derivation problem is NP-hard.

In the following section, a dynamic programming so-lution is presented, and is shown to be optimal in somespecial cases of the type-derivation problem. The algo-rithm performs recursive back-tracking search from the tar-get type back to the base-types. It is shown that the back-tracking algorithm for solving the optimal type-derivationproblem with respect to the structural-rules runs in polyno-mial time. For the intractable cases, we bound the depth ofback-tracking along only the problematic rules.

4 Type-Derivation Algorithms

We present a dynamic programming algorithm, whichwe will refer to as the Back Tracking (BT)-algorithm, forsolving the type derivation problem. It starts with the tar-get type, and back-tracks by considering the possible rulesthat could have been used until a complete derivation planis found. Given the base-types A, and a target type t, weconstruct a set of derivation plans to be considered, denotedby C (A � t), and an optimal plan is simply given by

DoptC (A � t) = argmin

{‖D‖ : D ∈ C (A � t)

}(1)

where ‖D‖ is the cost with respect to the cost model used.The definition of C (A � t) will recursively make use ofDopt(A′ � t ′) of some environment A′ and type t ′. The prop-erties of the algorithm, such as its guaranteed termination,time complexity, and optimality, all depend on the choiceof C (A � t). Therefore, the computation for Dopt

C (A � t) iscompletely characterized by the definition of the consider-ation set C (A � t) and the choice of the cost model. Thestructure of the BT-algorithm is as follows.

opt plan(A:base-types,t:target type)| C = consider(A,t);| if(C = /0) return NON DERIVABLE;| else return minimal plan in C ;consider(A:base-types, t:target type)| construct C with recursive calls to opt plan();| return C ;

In the subsequent sections, we describe the details of proce-dure consider().

First we show how the case of structural-rules can besolved by this approach in polynomial time with a choiceof C (A � t) which will be described and analyzed in detail.Then it is extended to encompass the rest of the rules.

DEALING WITH STRUCTURE-TRANSFORMATION RULES:Let us restrict our attention only to the set of structural-transformation rules described in Figure 1. We firstconstruct a set of maximal common-prefix embeddingswhich are defined below. Each embedding corresponds toa derivation, and the derivations of the maximal common-prefix embeddings form the consideration set needed in theBT-algorithm.

Recall that we view types as trees. The tree domaindom(t) of a type t is simply all the paths of the nodes in t.Let s and t be two type trees. We define a partial common-prefix embedding between s and t to be a partial functionθ : dom(s)→ dom(t) such that it satisfies the following con-dition:• θ(0) = 0, i.e., it maps the root of s to the root of t.• ∀p ∈ dom(s), θ(p) is defined =⇒ label(p) = label(θ(p)),

i.e., θ maps only nodes of s to nodes of t with the same label.• ∀p ∈ dom(s), θ(p) is defined =⇒ θ(parent(p)) =

parent(θ(p)), i.e., θ preserves the parent-child relation.

Partial common-prefix embeddings are henceforth re-ferred to simply as embeddings. Computing embeddingsbetween s and t can be done recursively in a straightforwardway: start at the root nodes of s and t, stop if they differ inlabeling or multiplicity, otherwise, build all possible match-ings of the children nodes, and continue with the matchedpairs.

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

Page 6: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - Syntactic

000 001 002

0010 0011

010

0110 0111 0000

010 011

Figure 5. A partial common-prefix embedding froms (left) to t (right).

The embedding θ is similar to the edit-script mappingfound in tree-to-tree editing [3]. However, here, we do notallow relabeling tag names of XML elements in one oper-ation. In order to perform this, one must first derive thechildren of the element (e.g. destruction), followed by aconstruction using the new tag name.

The domain of θ is dom(θ) = {p ∈ dom(s) :θ(p) is defined}, and the range of θ is rng(θ) = {θ(p) : p ∈dom(θ)}. For the same types s, t, we say that one embed-ding θ1 is greater than another θ2 if dom(θ1) ⊃ dom(θ2).The non-embedded nodes of s are dom(s)− dom(θ), andsimilar for t are dom(t)− rng(θ). The frontier of θ in s,Fr(θ|s) is the most upper non-embedded nodes in s. Thefrontier Fr(θ|t) in t is similarly defined.

Example: Consider the two types s and t shown in Fig-ure 5. The embedding θ in Figure 5 is θ = {0 �→ 0, 00 �→01, 01 �→ 00, 011 �→ 000}. Its domain and range aredom(θ) = {0,00,01,011}, rng(θ) = {0,00,01,000}. Thefrontiers are Fr(θ|s) = {000,001,002,010,0110,0111} andFr(θ|t) = {0000,010,011}.

Next we show how to recursively construct a deriva-tion from an embedding. Assuming that we already havethe optimal derivations Dopt(A � s) and {Dopt(A � t|p) :p ∈ Fr(θ|t)} with some environment A, we form the planDs,θ(A � t) that simulates the procedure in Figure 6. Thederivation plan Ds,θ(A � t) works as follows. First it gen-erates s using Dopt(A � s) (line 1). Next, Ds,θ replaces topunwanted nodes in s by nodes in t that share the same par-ent (lines 5-6). Then it adds the rest of the top nodes int that are not already in s after the replacements (line 7).Finally, it removes all the unwanted nodes in s not alreadyreplaced (lines 11-13). It is easy to see that Ds,θ(A � t) onlyneeds replace, delete and insert-rules and the derivationsDopt(A � s), Dopt(A � t|p) for p ∈ Fr(θ|t).Example: For the embedding of Figure 5, the fron-tiers are Fr(θ|t) = {t.0000, t.010, t.011} and Fr(θ|s) ={s.000,s.001,s.002,s.010,s.0110,s.0111}.

For p = t.0000, parent(p) = t.000. So θ−1(parent(p)) =s.011, and childn(θ−1(parent(p))) ∩ Fr(θ|s) ={s.0110,s.0111} which is non-empty. Therefore, accordingto the procedure in Figure 6, Ds,θ(t) non-deterministicallypicks s.0110 to be replaced by t.0000. The sub-type s.0111is deleted since it is not replaced by anything.

1 derive s from A using Dopt(A � s)2 for each p ∈ Fr(θ|t),3 derive t|p from A using Dopt(A � tp)4 if childn(θ−1(parent(p)))∩Fr(θ|s) �= /0, then5 pick one from childn(θ−1(p))∩Fr(θ|s), say q,6 replace q with t|p in s to derive s[t|p]q.7 else8 add t|p under the node θ−1(parent(p)) in s.9 end if10 end for11 for each q ∈ Fr(θ|s) not deleted or replaced,12 delete the sub-type q from s.13 end for

Figure 6. The procedure equivalent to Ds,θ(A � t).

This is repeated for all the other nodes in Fr(θ|t), and tosummarize, we get a plan Ds,θ(A � t) as follows.1. Replace s.0110 with t.0000, s.000 with t.010, s.001 with t.011.2. Remove s.002, s.010 and s.0111.3. No inserts are needed in this case.

At last, we construct the consideration set. The plansDs,θ(t) provide the building blocks to the consideration setfor the structural-rule set. Given the base-types in the en-vironment A, and the target type t = c(t1, t2, · · · , tn), theconsider() procedure constructs the consideration set asfollows:

C STR(A � t) ={

Dopt(A�t1) ··· Dopt(A�tn)A�t CON

}∪{

Ds,θ(t) : s is a subtype in A not under optional-node and θ ∈ CP(s, t)}

where CP(s, t) is the set of all maximal partial common-prefix embeddings from s to t. In C STR(A � t), we considerthe case that the type t is derived by construction from itschildren ti (each of which is optimally derived by Dopt(A �ti)), and the cases in which t is derived from a subtype s in Aby structural modification, which are the plans {Ds,θ(A � t)}.

The derivation of s in Ds,θ(t) simply consists of ap-plications of the membership- and destruct-rules. Notethat the construction of Ds,θ(t), and thus the definition ofC STR(A � t) make use of Dopt(A � t). So computation forDopt(A � t) is recursive.

Proposition 1. If the children of every node in A have dis-tinct labels, then the recursive computation of Dopt(A � t)using C STR(A � t) terminates in polynomial time.

Proposition 2. The recursive computation for Dopt(A � t)based on C STR(A � t) terminates and is the optimal deriva-tion if only structural-transformation rules with constantcosts are allowed.

An immediate corollary is that the optimal derivationproblem with respect to the structural-rules with constantcosts that satisfies the distinct labels among siblings can besolved exactly in polynomial time.

DEALING WITH THE COMPLEX RULES: To incorporatethe complex-rules, we augment the consideration set C (A �

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

Page 7: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - Syntactic

t) to incorporate the union-, apply-, optional- map- and thecatenate-rules.

C UNI(A � t) =

{DDES(A � union(ti)) Dopt(A∪{ti} � t)

A � tUNI

},

where union(t1, t2, · · ·) are the union sub-types found inA that are not under some other union. The derivationDDES(A � union(t1, t2, . . .)) is simply to obtain the sub-type union(t1, t2, . . .) from A using the membership and de-struct-rules.

C APP(A � t) =

{Dopt(A � si) Dopt(A∪{t j} � t)

A � tAPP

}

for each available web service f : s1,s2, · · · → t1, t2, . . .where t is one of the output types.

If the target type is a type with multiplicity constraints,i.e., of the form t[m,n], then consider also C MAP and C OPT:

C MAP(A � t[m,n]) =

{DDES(A � s[m′,n′]) Dopt(A∪{s} � t)

A � t[m,n]MAP

}

where s[m′,n′] are nodes with minOccurs and maxOccursconstraints in A that are not under an optional node and withm≤ m′ and n′ ≤ n.

We also consider all ways that t[m,n] can be created bythe optional- and list-rules.

C LIST(A � t[m,n]) =

⎧⎪⎪⎨⎪⎪⎩

{/0

t[0,n] OPTN,Dopt(A�t)

t[0,n] LIST}

for m = 0,{Dopt(A�t)

t[0,n] LIST}

for m = 1,

/0 else.

Thus, the augmented consideration set is:

C = C STR ∪C UNI∪C APP∪C MAP ∪C LIST.

A straightforward recursive computation of Dopt(A � t)does not terminate using the new consideration set since thesame rule can potentially be applied repeatedly. The solu-tion is to maintain a set of avoidance rule set X during theback-tracking recursion – these are the rules not to be ap-plied. So, during the back-tracking, we compute an optimalplan D(A � t|¬X ) that does not use any rules in X . Forinstance, for C APP the recursive definition for the consider-ation sets changes as follows. Let X ′f = X ∪{APP( f )} bethe new avoidance set that avoids applying the web servicef .

C APP(A � t|¬X ) =

{Dopt(A � s1|¬X ′f ) Dopt(A � s2|¬X ′f ) · · ·

A � tAPP

}

The avoidance set is augmented in a similar way whenwe back-track along the other rules. The avoidance setincreases strictly during recursion, so the evaluation ofDopt(A � t|¬ /0) is guaranteed to terminate, but in generalin exponential time, which is clearly not acceptable in prac-tice.

The exponential blow-up is due to the branching whenback-tracking along the union- and apply-rules. So, we sim-ply terminate the back-tracking if the number of avoidedweb services or unions reach their respective bounds. Thiscorresponds to finding a derivation plan that does not makeuse of consecutively composed web services of lengthgreater than the bound. The variation of the BT-algorithm

Nmaxws : the bound on service composition,

Nmaxuni : the bound on number of union-rules.

1 opt plan(A,t,X )2 | if num of APP in X > Nmax

ws or3 | num of UNI in X > Nmax

uni then4 | return NON-DERIVABLE;5 | end if6 | C = consider(A, t, X );7 | return minimal cost plan in C ;

8 consider(A, t, X )9 | C = C STR(A � t)∪C UNI(A � t|¬X )

∪ C APP(A � t|¬X );10 | if t = t ′[m,n] then11 | C = C ∪C MAP(A � t)∪C LIST(A � t)12 | end if13 | return C

Figure 7. The BT-bnd-algorithm

with bounded back-tracking will be referred to as the BT-bnd-algorithm. The overall BT-bnd-algorithm is shown inFigure 7. Note, if we remove lines 2–5 in Figure 7, we getback the BT-algorithm.

It is possible to extend the BT-algorithm to deal withmore rules in case additional derivations rules become avail-able. For instance, we have not included the catenate-ruleas part of the search, but it can be easily added by, yetagain, augmenting the consideration set C (A � t). Givent = c(t1, t2, · · · , tn), one can consider all plans of the form

Dopt(A � c(ti1 , ti2 , · · ·)) Dopt(A � c(t j1 , t j2 , · · ·))

A � c(t1, t2, · · · , tn)CAT,

where {i1, i2, · · ·}∪{ j1, j2, · · ·} = (1..n). Of course, thereare exponentially many such plans to consider, so a similarbounded search heuristics is needed. We expect that this canbe done for a great deal of other potentially useful rules.

AN APPLICATION OF THE ALGORITHM: We presentsample results from the application of our techniqueson real services. We have applied the BT-algorithmto a group of real internet web services described inws.strikeiron.com/USDAData?WSDL. This group of webservices provide access to a database of nutrient content ofvarious food items. There are five services available offer-ing searching and querying access to the database. Due tospace limitation, only the relevant types and operations (ser-vices) in the WSDL file are shown in condensed form inFigure 8.

A useful service is to retrieve nutrient information bykeyword search. This service is not immediately of-fered. However, the service SearchFood provides key-word search service but returns NDBNumbers as partof its output type SearchFoodByDescriptionResult,and service CalcNutrients accepts NDBNumber aspart of its input type and returns the nutrient con-tent in its output type. We let the base-types beA = {FoodKeywords(str),License(str)}, and the targett = MyNutrient[0,inf](NutrientValues[0,inf]). EachMyNutrient element consists of a list of NutrientValues

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

Page 8: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - Syntactic

SearchFoodByDescription

FoodKeywords [0:1] FoodGroupCode [0:1]

String String

SearchFoodByDescriptionResult

ArrayOfSearchByKeywordsOutput

SearchByKeywordOutput [0:Inf]

NDBNumber LongDescription FoodGroupCode

CalculateNutrientValues

LicenseKey NDBNumber [0:1] AmountInGrams [0:1] AmountInOunces [0:1]

CalculateNutrientResult

NDBNumber LongDescription NutrientValues [0:Inf]

NutrientNumber Description Unit GramValue

(a)

(b)

(c)

(d)

CalculateNutrientValuesResponse

SearchFoodByDescription SearchFoodByDescriptionResultSearchFood :

CalculateNutrientValues CalculateNutrientValuesResponseCalcNutrients :

(e)

Figure 8. The types and services in the WSDL file

corresponding to the nutrients in some given food item. Aderivation D(A � t) is a composition that allows one to ob-tain a collection of MyNutrient elements from a keywordrepresented by FoodKeywords.

The cost function cost(r(A � t)) that is used assigns acost to each rule r and is insensitive to the environmentA and the target type t. We assign a heavy cost to penal-ize the construct-rule and the optional-rule, a small cost tofavor web services’ apply-rule, and equal cost to all otherrules. This cost-model reflects the preference of derivationsfor using composition of web services to produce the targettype rather than using the structural-rules. One can exploremore complicated cost-functions depending on the applica-tion, but in our experiments, this rather simple choice ofthe cost function allows the BT-algorithm to successfullyreport correct derivations; in this case D(A � t), makes useof the two web services and some structural constructionand destruction. Similar results were obtained for other webservice scenarios as well; we omit them, as well as detailedtraces of the execution of the algorithm, due to space con-straints.

5 Experimental evaluation

This section describes in detail the set of experimentsperformed to evaluate the performance of the algorithms.We have implemented the BT- and BT-bnd-algorithmbased on the recursive definition of Dopt(A � t|¬X ), andC (A � t|¬X ). Some optimizations in the search are pos-sible. For instance, when backtracking along the union-rulein C UNI, we test to see if Dopt(A∪ {ti} → t) ever explic-itly uses ti in deriving t. If not, then one concludes that the

environment A is sufficient for deriving t optimally, thus,the algorithm can safely avoid considering deriving t fromunion(ti) using the union-rule.

In order to improve efficiency, we amortize the recursioncalls by maintaining an index of intermediate environmentsA, types t and the avoidance set X and the correspondingderivation Dopt(A � t|¬X ).

The parameters that are varied are the size of the base-types and the target type, the number of web-services, thenumber of union-nodes found in the base-type, and finallythe number of minOccurs, maxOccurs constraints in thebase-types. We examine the scalability of the algorithm interms of the number of derivation plans considered duringthe search, the total run-time, and the memory consumption.The base-types are generated based on the schemas speci-fied in the benchmark XBench [19]. However, in order toexplore the scalability of the algorithm, for larger datasetsand have ample flexibility in varying parameters of inter-est, we introduce additional nodes to these schemas. Theadditional tag names are randomly sampled from a fixed al-phabet set, and the number of children of each node is ran-domly determined to be between 3-10. These experimentsare carried out on a SunFire V440 server running Solaris 8with SPECInt2k rated at 703 and SPECFP2k 1054.SIZE OF INDIVIDUAL BASE-TYPES: We use 5 base-typesbased on the XBench schemas. The target type is a treeconstructed based on the base-types by combining randomsub-types from the base-types. The size of the target typeis fixed at 50 nodes. There are no union-nodes, and thereare 2 single-input web-services. The parameter we vary isthe size of the individual base-types – from 10 nodes to 500nodes. We also vary the size of the web-services’ input-typeand output-type from 5 to 50 nodes.

Figure 9(a) shows that the number of plans generatedduring the search has a polynomial growth with respect tothe size of the base-types. Note that when the size of theindividual base-types exceeds 200 nodes, the rate of growthchanges from polynomial of higher power to nearly linear.The explanation is that since we have a fixed-size alphabet,repeated sub-types are found more frequently in the base-types, thus, more of the plans in the consideration set arefound in the index, and do not need to be generated fromscratch. Increasing the input/output type size for the webservices uniformly increases the number of plans consid-ered.

Figure 9(b) shows the run-time of the search with respectto the individual base-type size. The run-time increasespolynomially with respect to the individual base-type size.Note that it does not display the near-linear behavior forlarger base-types as was the case for the number of plansconsidered. This is due to the fact that the overall run-timeincludes time for both building the common-prefix embed-dings and build the sub-plans. While the index helps ingenerating the sub-plans, building the common-prefix em-bedding still requires traversing through the base-types, andremains dominant.

Figure 9(c) shows the total memory used during the

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

Page 9: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - Syntactic

(a) Number of plans generated (b) Run-time (c) MemoryFigure 9. Performance with increasing base-type size

Figure 10. The performance with increasing number of base-types

Figure 11. Performance with increasing target size

search. Note that it is proportional to the number of plansgenerated. When the web-services have larger input-outputtypes, the BT-algorithm needs to consider more plans, thustaking time and memory. However, the trend remains con-sistent.NUMBER OF BASE-TYPES: We also measure the perfor-mance with respect to the number of base-types. The num-ber of base-types is increased from 1 to 200 while the size isfixed at 20 nodes. The other parameters remain unchanged.The number of plans generated, run-time and memory us-age are shown in Figure 10. We see again the effect of thechanging rate of growth for the number of plans generated.This is due to exactly the same reason as in Figure 9(a).

From Figure 9 and Figure 10, we verify that the per-formance is polynomial with respect to the size of the setof base-types, and is quite capable of handling large base-types, as well as a large number of them.SIZE OF TARGET TYPE: For the next set of measure-ments, we fix the size of base-types to 20 nodes, and utilize10 base-types for derivation. Again, the number of web-services is fixed at 2. We vary the target type size from 10to 1000 to explore the performance when deriving a very

large target type. The number of plans generated and therun-time are shown in Figure 11. Observe, that unlike thecase of increasing base-types, the run-time becomes nearlylinear when the target type size exceeds the threshold of200. Since the target is so much larger than the base-typesand the outputs of the web-services, construction time forthe common-prefix tree is determined by the base-type sizeand no longer by the target type size. In conclusion, the BT-algorithm scales well with respect to the size of the target-type.NUMBER OF WEB-SERVICES: We study the performancecharacteristics in terms of larger number of web-services.The base-types are fixed at 50 nodes, 5 base-types, and thetarget type is fixed at 50 nodes. The number of web-servicesis varied from 1 to 10. Each web-service has 2 input types,each with 5 nodes. We compare the performance of the BT-algorithm and BT-bnd-algorithm with different bounds onthe composition length.

Figure 12(a) shows the number of plans considered bythe BT-algorithm. As was shown in Section 4, an exponen-tial number of plans are generated. By bounding the lengthof the composition to 3, only a polynomial number of plansare generated as it is shown in Figure 12(b) (1). Also in Fig-ure 12(b) (2), we show the effect of increasing the length ofthe bound from 3 web-services to 4; it is evident that theperformance implication is marginal and the overall trend islow polynomial. We also change the number of input typesto 4, and bound the composition length to 3 as shown in thesame figure (3), again observing a smooth performance im-plication. The absolute run-time for the experiments in thisfigure are shown in seconds in Table 1.NUMBER OF union-NODES: Another source of exponen-tial explosion for the search space is the number of union-nodes in the base-type. The next set of results show theperformance with increasing number of union-nodes for

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

Page 10: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - Syntactic

(a) The BT-algorithm w. increasing number ofWS

(b) The BT-bnd-algorithm w. increasing num-ber of WS

(c) The BT- and BT-bnd-algorithm w. in-creasing unions

Figure 12. The performance of BT- and BT-bndalgorithms

Num of W.S.: 2 5 8 10(2) Bnd-WS = 3, Num of Inp = 2 32 51 69 82(3) Bnd-WS = 4, Num of Inp = 2 32 57 90 108(4) Bnd-WS = 3, Num of Inp = 4 38 66 94 113

Table 1. Vary number of web services

Num of UNION-nodes 2 5 8 10Bnd-UNI = 2, Num Children = 3 2 15 44 74Bnd-UNI = 3, Num Children = 3 2 51 172 263Bnd-UNI = 2, Num Children = 5 3 6 53 89

Table 2. Vary number of union nodes

the BTand BT-bnd-algorithms. We also vary the num-ber of children of the union-nodes. The number of plansgrows exponentially with respect to the number of unionsin the unbounded case, as expected, and polynomially inthe bounded case as shown in Figure 12(c). Correspondingabsolute run time in seconds is shown in Table 2.

The experiments demonstrate that the algorithms arescalable with respect to the size of base-types and the tar-get type, and sensitive to rules such as apply and union thatrequire multiple backtracking matching our analytical ex-pectation. These rules create an exponential explosion tothe amount of search required. Placing bounds on the com-position length and the number of union nodes to realisticvalues, make the problem tractable as demonstrated by ourexperiments.

6 Conclusions

We have presented a rule-based framework for web ser-vice composition. The problem of web service composi-tion is seen as a optimal type derivation problem which isshown to be intractable in general. We have characterizeda useful tractable class, and proposed suitable heuristics forthe general case. The experimental results demonstrate thefeasibility of our approach.

As part of on-going research, we are interested in ex-perimenting with diverse cost models and rule sets, as wellas incorporating semantics into our framework in order toimprove the quality of the composition.

References

[1] B. Benatallah, Q.Z. Sheng, A.H.H. Ngu, and M. Dumas. Declarative composi-tion and peer-to-peer provisioning of dynamic web services. In ICDE, 2002.

[2] Daniela Berardi, Diego Calvanese, Giuseppe De Giacomo, Massimo Mecella,and Rick Hull. Automatic composition of transition-based semantic web ser-vices with messaging. In VLDB, 2005.

[3] Philip Bille. Tree Edit Distance, Alignment Distance and Inclusion. Technicalreport TR-2003-23 in IT University of Copenhagen, 2003.

[4] Sudarshan S. Chawathe and Hector Garcia-Molina. Meaningful change detec-tion in structured data. In SIGMOD, 1997.

[5] A. Deutsch, L. Sui, and V. Vianu. Specification and Verification of Data DrivenWeb Services. In PODS, 2004.

[6] X. Dong, A. Halevy, J. Madhavan, E. Nemes, and J. Zhang. Similarity Searchfor Web Services. In VLDB, 2004.

[7] J. Madhavan, P. Bernstein, and E. Rahm. Generic Schema Matching with Cu-pid. In VLDB, 2001.

[8] S. Melnik, P. A. Bernstein, A. Halevy, and E. Rahm. Supporting executablemappings in model management. In SIGMOD, 2005.

[9] S. Narayanan and S. McIIraith. Simulation, Verification and Automated Com-position of Web Services. In WWW, 2002.

[10] N. Onose and J. Simeon. XQuery at Your Web Service. In WWW, 2004.

[11] Massimo Paolucci, Takahiro Kawamura, Terry R. Payne, and Katia P. Sycara.Semantic matching of web services capabilities. In ISWC, 2002.

[12] Shankar R. Ponnekati and Armando Fox. SWORD: A Developer Toolkit forWeb Service Composition. In WWW, 2002.

[13] Lucian Popa, Yannis Velegrakis, Renee J. Miller, Mauricio A. Hernandez, andRonald Fagin. Translating web data. In VLDB, 2002.

[14] Erhard Rahm and Philip A. Bernstein. A survey of approaches to automaticschema matching. In VLDB Journal, volume 10, pages 334–350, 2001.

[15] Erhard Rahm, Hong-Hai Do, and Sabine Mabmann. Matching large XMLschemas. In SIGMOD Record, volume 33, pages 26–31, 2004.

[16] Dennis Shasha and Kaizhong Zhang. Fast algorithms for the unit cost editingdistance between trees. In J. Algorithms, volume 11, pages 581–621. AcademicPress, Inc., 1990.

[17] Hong Su, Harumi Kuno, and Elke A. Rundensteiner. Automating the transfor-mation of XML documents. In WIDM, 2001.

[18] A. Wombacher, Peter Fankhauser, B. Mahleko, and E. Neuhold. Matchmakingfor business processes based on choreographies. In IEEE International Confer-ence on e-Technology, e-Commerce and e-Service (EEE), 2004.

[19] B.B. Yao, M.T. Ozsu, and N. Khandelwal. XBench benchmark and perfor-mance testing of xml dbmss. In ICDE, pages 621–632, 2004.

[20] L. Zeng, B. Benatallah, M. Dumas, J. Kalagnanam, and Q. Z. Sheng. Qualitydriven web services composition. In WWW Conference, 2003.

[21] Kaizhong Zhang, Jason T. L. Wang, and Dennis Shasha. On the Editing Dis-tance between Undirected Acyclic Graphs and Related Problems. In Interna-tional Journal of Foundations of Computer Science, 1995.

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE


Top Related