using svm and concept analysis to support web service classification and annotation

10
Using SVM and Concept Analysis to support Web Service Classification and Annotation Marcello Bruno, Gerardo Canfora RCOST - Research Centre on Software Technology University of Sannio, Department of Engineering Palazzo ex Poste, Via Traiano 82100 Benevento, Italy [email protected], [email protected] Massimiliano Di Penta, Rita Scognamiglio RCOST - Research Centre on Software Technology University of Sannio, Department of Engineering Palazzo ex Poste, Via Traiano 82100 Benevento, Italy [email protected], [email protected] ABSTRACT The need for supporting the classification and semantic annotation of services constitutes an important challenge for service–centric software engineering. Late–binding and, in general, service match- ing approaches, require services to be semantically annotated. Such a semantic annotation may require, in turn, to be made in agreement to a specific ontology. Also, a service description needs to properly relate with other similar services. This paper proposes an approach to i) automatically classify ser- vices to specific domains and ii) identify key concepts inside ser- vice textual documentation, and build a lattice of relationships be- tween service annotations. Support Vector Machines and Formal Concept Analysis have been used to perform the two tasks. Results obtained classifying a set of web services show that the approach can provide useful insights in both service publication and service retrieval phases. 1. INTRODUCTION One of the most relevant advantages of service–centric software engineering is the possibility a developer has to build his/her own system as a composition of one or more abstract services, i.e., se- mantic descriptions that can be matched at run–time with the de- scription of one or more concrete services. The subsumption re- lationship between an abstract service and the concrete services is completed by means of matching algorithms integrated in the ser- vice broker [21]. The choice of the actual concrete service to bind to an abstract service can also consider concrete services’ Quality of Service (QoS) attributes [33]. The above described scenario requires that each service must have a semantic description, according to a specific ontology 1 . Service semantic annotation is, however, a difficult task that, given the ac- tual state–of–the–art, is often too expensive to be done in practice. Also the building and maintenance of ontologies requires expertise and budgets not always available. Unfortunately, very often the only source of information available is a pure–textual description of the service, sometimes extracted from source code comments. During service publication, it would be therefore useful to exploit this form of textual information to: permit an automatic classification of services to be published according to the broker’s service ontological classification. Even when this activity can easily be made manually by the service publisher, the automatic classification can provide a feedback to indicate whether or not the service textual de- scription is meaningful with respect to the class the service belongs to; support the building and maintenance of domain–specific on- tologies. When a set of new services is going to be published, the related domain–specific ontology needs to be built, if it does not yet exist. When such an ontology is already avail- able, the publication of a new service could add new con- cepts, and therefore trigger the need for updating the ontol- ogy; aid the semantic annotation of a service with respect to the ontology. By detecting concepts inside the service textual documentation, it would be possible to see how the service concepts can be identified in the ontology, and how the ser- vice can be cataloged with respect to other existing services. For example, it should be able to see if, according to its textual description, a service appears more specific, more generic, or maybe alternative to existing ones. Again, if the service publisher realizes that the extracted concepts, or the classification of the service with respect to others is meaning- less, ambiguous or inconsistent, then the service description needs to be corrected in some way. 1 There is work investigating the possibility of matching between services described with different ontologies. This aspect, however, is out of scope for this paper and will not be further considered. 1

Upload: independent

Post on 22-Nov-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

Using SVM and Concept Analysis to support Web ServiceClassification and Annotation

Marcello Bruno,Gerardo Canfora

RCOST - Research Centre onSoftware TechnologyUniversity of Sannio,

Department of EngineeringPalazzo ex Poste, Via Traiano

82100 Benevento, Italy

[email protected],[email protected]

Massimiliano Di Penta,Rita Scognamiglio

RCOST - Research Centre onSoftware TechnologyUniversity of Sannio,

Department of EngineeringPalazzo ex Poste, Via Traiano

82100 Benevento, Italy

[email protected],[email protected]

ABSTRACTThe need for supporting the classification and semantic annotationof services constitutes an important challenge for service–centricsoftware engineering. Late–binding and, in general, service match-ing approaches, require services to be semantically annotated. Sucha semantic annotation may require, in turn, to be made in agreementto a specific ontology. Also, a service description needs to properlyrelate with other similar services.

This paper proposes an approach to i) automatically classify ser-vices to specific domains and ii) identify key concepts inside ser-vice textual documentation, and build a lattice of relationships be-tween service annotations. Support Vector Machines and FormalConcept Analysis have been used to perform the two tasks. Resultsobtained classifying a set of web services show that the approachcan provide useful insights in both service publication andserviceretrieval phases.

1. INTRODUCTIONOne of the most relevant advantages of service–centric softwareengineering is the possibility a developer has to build his/her ownsystem as a composition of one or moreabstract services, i.e., se-mantic descriptions that can be matched at run–time with thede-scription of one or moreconcrete services. The subsumption re-lationship between anabstract serviceand theconcrete servicesiscompleted by means of matching algorithms integrated in theser-vice broker [21]. The choice of the actualconcrete serviceto bindto anabstract servicecan also considerconcrete services’Qualityof Service (QoS) attributes [33].

The above described scenario requires that each service must have

a semantic description, according to a specific ontology1. Servicesemantic annotation is, however, a difficult task that, given the ac-tual state–of–the–art, is often too expensive to be done in practice.Also the building and maintenance of ontologies requires expertiseand budgets not always available. Unfortunately, very often theonly source of information available is a pure–textual descriptionof the service, sometimes extracted from source code comments.

During service publication, it would be therefore useful toexploitthis form of textual information to:

• permit an automatic classification of services to be publishedaccording to the broker’s service ontological classification.Even when this activity can easily be made manually by theservice publisher, the automatic classification can provide afeedback to indicate whether or not the service textual de-scription is meaningful with respect to the class the servicebelongs to;

• support the building and maintenance of domain–specific on-tologies. When a set of new services is going to be published,the related domain–specific ontology needs to be built, if itdoes not yet exist. When such an ontology is already avail-able, the publication of a new service could add new con-cepts, and therefore trigger the need for updating the ontol-ogy;

• aid the semantic annotation of a service with respect to theontology. By detecting concepts inside the service textualdocumentation, it would be possible to see how the serviceconcepts can be identified in the ontology, and how the ser-vice can be cataloged with respect to other existing services.For example, it should be able to see if, according to itstextual description, a service appears more specific, moregeneric, or maybe alternative to existing ones. Again, if theservice publisher realizes that the extracted concepts, ortheclassification of the service with respect to others is meaning-less, ambiguous or inconsistent, then the service descriptionneeds to be corrected in some way.

1There is work investigating the possibility of matching betweenservices described with different ontologies. This aspect, however,is out of scope for this paper and will not be further considered.

1

The usefulness of a semi–automatic support for service classifica-tion and annotation is not limited to service publication phase. Infact, it can also be used during service retrieval. Let us supposethat a service integrator is querying (sending a free–text query) thebroker to search for a service performing a particular task.Suchan automatic classification mechanism can be applied to free-textqueries to:

• identify the category (or the scored list of categories) in whichany service matching the query can be found;

• ease the browsing among the available services, once the ser-vice integrator chooses a category, As it will be clearer later,the relationships between different services belonging toaspecific domain can be represented, with some simplifica-tions, using aconcept lattice. Thus, it would be useful to de-velop a mechanism able to identify the lattice region in whichthe service the integrator is searching for could be found.

This paper proposes an approach that, starting from servicetex-tual description, performs an automatic classification (tocatalogservices across specific domains, such as telecommunications, fi-nance, etc.), and then identifies service key concepts and their rela-tionships as aconcept lattice. The approach relies on Support Vec-tor Machines (SVM) [13] and Information Retrieval (IR) VectorSpaces for service classification, and uses Formal Concept Analy-sis (FCA) [30] to buildconcept latticesfrom service descriptions.

To perform a preliminary assessment of the proposed approach,a set of 205 services, manually classified in 11 categories, weredownloaded from the net [4] and from some Universal Description,Discovery and Integration (UDDI) registries. Then, the classifica-tion approach was applied to these services. The results showedthat the approach was able to correctly classify, at a first chance,63% of the services. Considering a scored list of the first three cat-egories to which a service may belong to, we found that 83% ofthe services were present in the scored list. After the classification,we restricted our attention on a particular category of services, try-ing to build aconcept latticefrom service descriptions. The resultsshowed that, even if a totally automatic construction of thelatticeis not feasible, FCA still gives aids and useful insights to help thepublisher annotating the service and, when necessary, maintainingthe ontology.

The remainder of the paper is organized as follows. First, Section 2provides an overview of the related literature and available tools,while Section 3 gives, for completeness’ sake, some basic notionson SVM, IR vector spaces and FCA. Then, Section 4 describes theproposed approach and its application scenarios. The first resultsobtained are presented and discussed in Section 5. Finally,Sec-tion 6 concludes.

2. RELATED WORKThe work presented in this paper deals with different specific prob-lems: classifying web services automatically on the base oftheirtextual documentation using SVM method, supporting the build-ing or maintenance of domain–specific ontology and the detectionof its concepts inside the service textual documentation. The mainpurpose is of annotating service with respect to an already exist-ing ontology, completing the ontology with new concepts extractedfrom the service, as well as the cataloging of the service with re-spect to other existing services.

Text classification has seen a great deal of success with the applica-tion of several studies addressed towards machine learning[8, 13,17, 23, 31, 32]. Among the many learning algorithms, SVM [28]appears to be most promising. The first application of text clas-sification using SVM has been presented by Joachims [13]. Theresults were also confirmed by different other studies [13, 14, 32].Joachims et al. [15] developed a theoretical learning modelof textclassification for SVMs, which provides some explanation aboutSVMs performance in text classification.

The manual construction and maintenance of specific–domainon-tologies is an expensive and complex work, requiring significantwaste of effort and time, as well as a detailed knowledge of thedomain to be modeled. Fridman Noy at al. [19] describe the knowl-edge model ofProtege 2000[3], an ontology–editing and knowledge-acquisition environment. Tao [26] developed aProtege 2000FCA–based plug–in for the building and maintenance of ontologies.

Concept lattices are particularly useful to discover the “hidden se-mantic” behind a set of data. In fact, it makes the conceptualstruc-ture visible and accessible, determining the keywords which relatedifferent data items. Such relationships may, in turn, exhibit pat-terns, regularities, exceptions, etc. Overall, the lattice can help inontology processing such as building, structuring, refining, merg-ing, mapping etc.

Up to now, some work has been done in the field of automatic sup-port for ontology building. An example of using FCA in ontologymerging has been proposed by Maedche et al. [18]. However, fewpapers investigated the possibility of using FCA in ontology build-ing and structuring. Cimiano et al. [20] discuss how FCA can beused to support ontology engineering and how ontologies canbeexploited in FCA applications. They present the method “FCA-Merge” for merging ontologies following a bottom–up approach.Hele-Mai Haav [11] presented an approach, based on Natural Lan-guage Processing (NLP), for the automatic or semiautomaticdis-covery of domain-specific ontologies from free text.

Kim and Compton [16] propose an ontology browsing mechanismrelying on FCA and incremental knowledge acquisition mecha-nisms. JBraindead Information Retrieval System[10] combinesa free–text search engine that uses FCA to organize the results of aquery. This work showed that conceptual lattices can be veryusefulto group relevant information in a free–text search task.

3. BACKGROUNDThis section reports some background notions on the main mathe-matical tools used in our work, namely Vector Spaces, SVM, andFCA.

3.1 Vector SpacesVector Space IR models map each incoming document and eachquery onto a vector [12]. In our case, each element of the vectorcorresponds to a word (or term) in a vocabulary extracted from theservice textual documentation. If|V | is the size of the vocabulary,then the vector[di,1, di,2, . . . di,|V |] represents the documentDi.Thej-th elementdi,j is a measure of the weight of thej-th term ofthe vocabulary in the documentDi. Different measures have beenproposed for this weight. We use a well known IR metric calledtf -idf [22]. According to this metric, thej-th elementdi,j is derivedfrom theterm frequencytfi,j of thej-th term in the documentDi

and theinverse document frequencyidfj of the term over the entireset of documents.

2

The vector elementdi,j is:

di,j = tfi,j ∗ log(idfj)

Compared to simple word frequency, thetf–idf metric permits tofilter out both low frequency words (not relevant) and words ap-pearing in most of the documents (non–discriminant words).

3.2 Support Vector MachinesSVM is a learning algorithm based on the idea ofstructural riskminimization(SRM) [27] from computational learning theory. Givena labeled set ofM training samples(xi, yi), wherexi ∈ ℜN

is the i-th input vector component andyi is the associated label(yi ∈ {−1, 1}), a SVM classifier finds the optimal hyperplanethat correctly separates (classifies) the largest fractionof data pointswhile maximizing the distance of either class from the hyperplane(the margin).

Computing the best hyperplane is posed as a constrained optimiza-tion problem and solved using quadratic programming techniques.The discriminating hyperplane is defined by the level set of

f(x) =M

X

i=1

yiαi ∗ k(x, xi) + b (1)

wherek(., .) is a kernel function and the sign off(x) determinesthe membership ofx. Constructing an optimal hyperplane is equiv-alent to finding all the nonzeroαi. Any vectorxi that correspondsto nonzeroαi is asupported vector(SV) of the optimal hyperplane.A desirable feature of SVM is that the number of training points,which are retained as support vectors is usually quite small, thusproviding a compact classifiers.

For a linear SVM, the kernel function is just a simple dot productin the input space, while the kernel function in a nonlinear SVMeffectively projects the samples to a feature space of higher (possi-ble infinite) dimension via nonlinear mapping functionφ from theinput spaceℜn to z = φ(x) in a feature spaceF and the constructa hyperplane inF. The motivation behind this mapping is that itis more likely to find a linear hyperplane in the high dimensionalfeature space.

SVM is quite simple to be used, since it requires few parametersto be tuned. Last but not least, with respect to other supervisedapproaches, such as Artificial Neural Networks (ANN), SVM re-quire a small training set. This is relevant when, as for the workpresented in this paper, the available training set is not very large.

3.3 Formal Concept AnalysisFCA is a mathematical tool that allows to identify groups of objectshaving common attributes. First FCA study dates back to 1940,when G. Birkoff proved the possibility of construction of a latticestarting from binary relationship betweenobjectsandattributes[7].FCA can be thought of as the process of searching “rectangles”in a boolean table representing a relation betweenobjectsandat-tributes. Thus, a concept is a maximal rectangle in a table wherecolumns and rows permutations are allowed. More precisely [29],

FCA starts with acontext, a triple,C = (O, A,R), whereO is afinite set ofobjects, A is a finite set ofattributes, andP ⊆ O × A

is a relation betweenO andA. If the pair(o, a) ∈ P , it can be saidthatobjecto hasattributea. Given a set of objectsX ⊆ O,

σ(X) := {a ∈ A | ∀o ∈ X : (o, a) ∈ P}

is the set ofcommon attributeswhile, givenY ⊆ A,

τ (Y ) := {o ∈ O | ∀a ∈ Y : (o, a) ∈ P}

is the set ofcommon objects.

A conceptis a pair of sets(X, Y ) whereX ⊆ O is called theex-tent, Y ⊆ A is called theintent, andY = σ(X), X = τ (Y ).That is, aconceptis a maximal collection ofobjectssharing com-monattributes. The set of allconceptsis denoted byB(O, A,R).Furthermore, aconcept(X1, Y1) is a subconcept of anothercon-cept(X2, Y2) if X1 ⊆ X2. This imposes a partial order relationon B(O, A, R), and it can be written that(X1, Y1) � (X2, Y2).The partial order�, can be used to build a lattice calledconceptlattice, where each node represents aconcept. Theconcept latticeintroduces a hierarchical clustering ofobjectsandattributes, whereupper concepts factor out commonattributes, while lower conceptsfactor out commonobjects. More details can be found in [24, 29].

4. APPROACH DESCRIPTIONAs stated in the introduction, the proposed free–text service classi-fication approach aims at accomplishing a two-fold task:

1. perform the automatic classification of a service description,i.e., determine to which category/domain a service belongsto; and

2. locate a service description in aconcept lattice.

The remainder of this section will explain in details the three stepsof the classification approach, depicted in Figure 1, namelythe textpreprocessing, the service classification, and the construction of theconcept latticefor a specific domain.

4.1 Text PreprocessingThe first step aims to preprocess service textual descriptions. Tex-tual description of web services might be in the form of Web Ser-vice Description Language (WSDL) documents, coming from UDDIregistries, as well as any other textual document provided as a doc-umentation support for the service itself. Words are extracted fromdocuments. For WSDLs, the content of thedocumentationtag canbe used, as well as themessagetags. In particular, the latter canbe useful assuming that message names and parameters are signif-icant and reflect the service functionalities. Similar assumptionshave been made in approaches aiming to recover traceabilitylinksbetween high–level documentation to source code [6].

The extracted words are then preprocessed. Composite wordsaresplit; this may be the case of message names (e.g., “BookTicket”)where each part of a composite word starts with a capitalizedlet-ter. Successively, words are filtered by means of a stop–list, and

3

Figure 1: The service classification approach

normalized. The stop–list contains articles, prepositions, and ingeneral words that are frequent in each query, and thereforenotdiscriminant (”web”, ”service” or ”SOAP”). During the stemmingphase, verbs are brought back to infinitive, plurals to singulars, etc.using the Wordnet dictionary [5] and its Java API. The Wordnet dic-tionary is also complemented with a thesaurus incrementally builtduring service publication.

4.2 Service ClassificationThe classification of services into domain–specific classesis per-formed using the SVM method. In our implementation, the freelyavailable LIBSVM tool [1] was used.

As stated in the introduction, automatic service classification bothserves during service publication (to classify the new service) andservice retrieval (to identify the class(es) where to restrict the focusof the query). In this section’s context, both web service documen-tation and user queries are considered as a textual description to beclassified (represented as grey arrows in Figure 1).

Prior to apply SVM, sequences of words, obtained in the previ-ous preprocessing phase, must be mapped onto vectors. In ourap-proach, the mapping is achieved using IR techniques. Each elementof the vector corresponds to a word (or term) in a vocabulary ex-tracted from the documents themselves. All words are weightedwith tf-idf metric. In this way, each document is mapped onto avector using an injective function. The whole document set is en-coded in a matrix, where rows represent documents (vectors)andcolumns are the weighted words. No information about the posi-tion or the meaning of the words is used, i.e., no semantic is knownusing this matrix. An alternative approach, used by Di Luccaetal. [9], encodes words to numbers using a dictionary, and then rep-resents documents as vectors of codes. Though simpler and faster,in our case, this significantly lowered performances, thustf-idf were

preferred.

A classification task using supervised algorithm such as SVMorANN requires a training set. In other word, our SVM needs tobe trained with a pre–classified set of documents. This producesa ”model matrix” that will be used to “predict” to which classthedocument to be classified belongs to.

4.3 Building the Concept LatticeOnce classified the service, or re–directed the query to a specificdomain, key concepts need to be extracted from the service/query2

and their lattice needs to be built. Clearly, such a lattice only repre-sents a simplification of a domain ontology. FCA advantage comesfrom the way it shows how the presence or absence of attributedistinguishes objects, i.e., by means of super–concept/sub–conceptrelationships. A concepts lattice can well represent services namesand keywords belonging to a specific domain, highlighting “isA”relationships between concepts and attributes. On its own,an ontol-ogy can also contain other complex semantic relationships specify-ing how one concept is related to another. Besides that, we believethat a lattice representation of service concepts can stillbe useful,as a support for ontology building, for service semantic annotation,as well as to better identify which services can fulfill a query, whichones are more generic, which ones more specific, etc.

Without loss of generality, let us suppose we want to build a con-cept lattice from a set of service descriptions. First and foremost,we need to identify discriminant words, useful for the lattice. Tothis aim, we use theidf metric to eliminate words that do not ap-pear in at least two or more, depending on the number of docu-ments/services belonging to that domain/class. These words, al-though useful (according totf-idf definition) for classification, do

2From this point we will refer both as “service” indistinctly.

4

book ticket theater music rock opera sport football stadium movieMusicTicket X X X

RockMusicTicket X X X X XOperaMusicTicket X X X X X

Sport X X XFootballSport X X X X X

Cinema X X X

Table 1: Formal context of booking services domain

not contribute to identify useful concepts. Once words havebeenfiltered (and thus keywords identified), the context can be identifiedas the inclusion relationship of keywords into documents. Moreformally:

A service context is a triple C=(S, K, I) where S is a set of servicenames (the objects), K is the set of service description keywords(the attributes), and I the binary relationship which indicates thepresence or absence of words into documents.

The obtained lattice may be used to identify concepts for a specificdomain, as well as the relationships between services belongingto a class. Such a lattice aids a service publisher when providingservice semantic annotation, in that it tries, starting just a textualdescription, to discover the service “hidden semantic”.

The following simple example shows how FCA can be used to buildsuch a concept lattice and how such a lattice can aid publisher andintegrator activities. The example deals with a set of ticket bookingservices for different kind of events, from sport event to movie andtheater. Table 1 shows the example’s context:S is composed ofservices names (MusicTicket, RockMusicTicket, OperaMusicTicket,etc.),K is composed of service description keywords (book, ticket,theater, etc.) and theI relationship is represented by an “X”.

Figure 2 depicts the lattice obtained from the context of Table 1.It is interesting to point out the hierarchical relationships betweendifferent services. For exampleSportactually appears as a parentof FootballSport, while both derive from the conceptsport. Byadding a new service and refining the existing ones, the structure oflattice can be reformulated incrementally and automatically, with-out changing the formal concept.

5. EMPIRICAL STUDYTo validate and gain insights about the usefulness of the proposedapproach, we performed an experiment aiming to classify a set ofweb service documentations, and to build lattices for services be-longing to some particular classes/domains. Results are presentedand discussed in this section.

5.1 Case Study DescriptionGetting a suitable and extensive case study for experimentsdealingwith web services is still a challenge. Although, at the timeof writ-ing, several UDDI registries exist and are available for querying,too often the set of services obtained is almost useless. Even theservice are trivial, or their documentation is dummy. In ourexpe-rience, many of the available service documentation contain sen-tence such as ”this is a test” or ”Use ’0000’ as password”. Anotherproblem was that the number of words in a service descriptionwasalmost always below 20 words, a large percentage of which werenot relevant for classification.

Figure 2: Ticket domain lattice

We used, as a case study, a set of pre–classified services availableon the net [4] and downloaded from some UDDI registries. Sucha set was composed of 205 services, classified in 11 classes, rep-resenting domains such asnews, weather, credit card, etc. Theservice distribution across classes is shown in Figure 3. Assaid,each service is provided with a short description, extracted fromthe WSDL<documentation> tag. For example, a Credit CardWeb Service presents a description as follows:

”Will accept and validate a Credit Card Number. Returns Truefora Valid Number and Returns False for a Invalid Number”

5.2 Service Classification ResultsSVM service classification performances were measured using theleave–one–out validation[25]: each document (vector) in the set(matrix) was classified using a SVM model built using the remain-ing ones, and the percentage of correct classifications was mea-sured. We found that different performances can be achievedbyproperly calibrating the SVM parameters, namely the kernelfunc-

5

Figure 3: Histogram of service categories

tion and the gamma parameter. In our case study, the highest preci-sion has been achieved using thesigmoid kernel, that outperformedthe linear kernel.

Other than the first classification obtained for each service, we letthe SVM find alternative classification, for which we also measuredthe correct classification ratio, incrementally with respect to thefirst classification ratio. This permits to obtain, for each service,an ordered list of classes, to which the service has the highest like-lihood to belong, according to our classification algorithm. In otherwords, the algorithm ensures that the service belongs, witha givenlikelihood, to the first class, with an higher likelihood to one of thetwo first classes, etc.

Figure 4 reports, for our case study, theleave–one–out validationperformances for the first three classes in the score. As shown, forthe first score position, the precision is of about 63%. This is notthat high (up to 84% was obtained for software maintenance ticketclassification by Di Lucca et al. in [9]), however reasonable, con-sidering the quality and quantity of the training set, and that theapproach was able to classify across 11 classes. When looking fur-ther, to best–two and best–three class scores, we found that, clearly,performance increases, respectively to 73% and 83%. In conclu-sion, although the approach could not always suggest the correctclass, at least is able to indicate a limited group of classesto whichthe service could belong.

5.3 Building Service Concept LatticeAfter classification was performed and thus services belonging toeach category identified, we built concept lattices for eachcategoryusing FCA. As described in Section 4, words having highidf werepruned, in that they are considered not relevant for building theconcept lattice.

Figure 5 shows a lattice obtained applying FCA to documents be-longing to the “Credit Card” class. Each node in the lattice canshow both concepts and objects. Concepts appear in the high partof the node, while objects in the low. In our case, concepts rep-

Figure 4: Precision of SVM service classification for best one,two and three classes

resent sets of keywords, while objects represents services. Genericconcepts (referred astop concepts) are placed in the high part of thelattice (card, credit, Visa, Mastercard, etc.). The lattice easy per-mits to find, for example, services that both supportVisaor Mas-tercard(service11andservice6). service6is considered to be morespecific thanservice11because, according to its description, it canvalidate credit card numbers, whileservice11does not advertisesuch a feature. Actually, the descriptions of our two services ap-pear as follows:

• service11:“Generic web service for VISA, MasterCard”

• service6:“Validate MasterCard, and VISA credit card num-bers”

Going further in our lattice analysis, it can be noted thatservice16can be used to validate credit cards and it seems to be more specificthanservice1. This reflects what is specified in their description:

• service1: “credit card (whole text is: Offering Loans AndCredit Cards to Consumers)”

• service16: “accept card credit validate valid (whole text is”Will accept and validate a Credit Card Number. ReturnsTrue for a Valid Number and Returns False for a InvalidNumber”)”

Note that word relevance depends on the term document frequency.If a word appeared only in one document, it has been pruned (”loans”,”consumer”, ”invalid”). Other words have also been stopped. Ac-cording to the proposed approach, new concepts are added to thelattice when terms appear in more than one document. Therefore,if a new service description containing the wordconsumeris usedto expand the lattice, a new concept will be added, and the latticestructure will change.

Figure 6 shows a more complex example of services lattice forthemail domain. In this lattice, the wordsmail, email, send, address,

6

Figure 5: Concept lattice of services: the credit card example

serverand evenhotmailare classified astop concepts. This exam-ple also shows how it is possible to identify which services are morespecific than others. For example, bothservice14andservice15canverify if a hotmailor ayahoouser exists, howeverservice15is free.

5.4 Using the concept lattice for queryingAs discussed in the introduction, lattices can also be used to iden-tify which services can match a query, which ones are more specificand which ones more generic than the query. Let us suppose, forexample, that a user is searching for a web service to verify thecorrectness of email addresses. After stopping and stemming, apossible query may contain these words:”service verify email ad-dress”. It is possible to build a software system that, integratingtheproposed approach, can show a portion of a services lattice or treecontaining the desired services (see Figure 7-a). If the user wishesto refine his query, looking for afreeservice, the system will furtherrestrict the portion of lattice to be shown (Figure 7-b).

5.5 DiscussionResults obtained for service classification showed how the approachcan be useful. It appears evident that the automatic classificationhelps, identifying, with a likelihood of 83%, the ordered list of 3out of 11 classes among which the service may be classified.

The service publisher can exploit this result from several points ofview. First and foremost, he/she can accept one of the classifica-tions proposed by the automatic tool, possibly manually refining

the choice. In this case, the tool helps in reducing the publisherclassification degrees of freedom across a limited number ofser-vice classes/domains.

It may happen that the proposed classes may be completely differ-ent than publisher expectancies. If a “weather” service is classifiedin the “finance” or “mail” domain, it means that the service descrip-tion may be ambiguous. In this case the tool raises the publisher’sattention to the problem, highlighting the need for correcting thedeployed service documentation. This will reduce the risk that,during service search, the service is never found by queriesrelatedto its own features and, instead, it is found by queries related toother kind of features.

Regarding concept lattice building, it appears immediately clearthat a completely automatic ontology or semantic annotation build-ing is unfeasible. This, however, was not our purpose. Instead,we found that, while human supervision and intervention cannotbe avoided, useful insights can be obtained from service lattices.In fact, by highlighting relationships between services, it can helpto build and refine the service semantic annotation. By looking tothe lattice, the publisher can found that some keyword may sim-ply make the service annotation heavier, or even misleading. Thus,it can be decided to remove or replace these keywords. When aservice developer publishes some services, he/she is awareof thegenericity/specificity of the services. If this is not reflected in thelattice, it means that service descriptions are misleadingor incom-

7

email

send format

_service11_ _service7_

plain

_service29_

spam

pop

mailbox

_service28_

list

accept

mail

_service27_

message

simple

_service18_

smtp

_service1_

provide

_service26_

address

_service5_

check

valid validate

exist

_service25_

server

dns

_service24_

domain

_service13_

available utility string strip username

_service23_ _service16_ _service12_ _service10_

hotmail

_service19_

verify

connect

_service22_

single

_service21_

_service20_

free

validation

_service4_

_service17_

yahoo user

_service14_

_service15_

_service9_

_service8_

_service6_

soap

_service3_

_service2_

Figure 6: Concept lattice of services: the mail domain

8

Figure 7: Using concept lattice when querying

plete. More generally, if the web service textual description is in-complete, too generic or containing phrases not properly related tothe service features, it may be hard to automatically classify it andto find the correct position in the concept lattice.

Much in the same way, let us suppose that some domain–specificservices have been already published and semantically annotatedaccording to a specific ontology, and that we want to publish anewservice. By using proper tools, such as the FCA plug-in forProtege2000[2], it can be possible to extract a context from the ontology. Ifwe add a row, representing our service keywords, to such a context,and then we build a lattice using FCA, we will be able to immedi-ately highlight how our service can be annotated with respect to theontology.

The second consideration we can make about the usefulness ofthese concept lattices is related to ontology building. As statedin the introduction, service annotations coherent with ontologiesmay be necessary to allow automatic service matching for late–binding. The concept identification made by FCA, as well as thelattice structure of these concepts, although giving a limited viewof an ontology, can indeed be useful for its building, completionor maintenance. In fact, when publishing new service, new con-cepts may need to be added, especially if the ontology is not yetcomplete.

Conversely, when a user performs a query to retrieve a service,the following scenario can happen. First and foremost, the useris guided, by the SVM classifier, to focus on some particular do-mains. In these domains, the portion of lattice of interest is high-lighted, significantly easing the service search. Finally,our studiessuggested that lattices appear to be useful when focusing towell re-stricted domains. Wide domains and upper ontologies would gen-erate, in fact, unmanageable and difficult to understand lattices.

6. CONCLUSIONSThis paper presented an approach, based on machine–learning tech-niques, to support service classification and annotation. Startingfrom free–text service documentation, services are automaticallyclassified in classes/domains using Support Vector Machines. Suc-cessively, Formal Concept Analysis is used to build serviceconceptlattice for each specific domain.

Results of a classification experiment on a set of 205 services down-loaded from the web shown the feasibility of the approach. Al-though needing user guidance, automatic classification, bypropos-ing the nearest–three classes out of 11 with a likelihood of 83%,can ease and support the service publication and annotation. Muchin the same way, the obtained concept lattices highlighted relation-ships existing between services, and aided the identification of do-main key concepts. Finally, we showed with some examples howthe same approach can also be integrated in the service retrievalmechanism.

Work–in–progress is devoted to further improve the proposed tech-nique, to confirm the obtained results with other case studies, andto integrate the approach in a service broker we are developing in aproject in cooperation with an Italian software company.

7. REFERENCES[1] Libsvm tool. http://www.csie.ntu.edu.tw/ cjlin/libsvm/.

[2] Plugin for protege–2000. http://www.ntu.edu.sg/.

[3] Protege–2000. http://protege.stanford.edu/.

[4] Textual documentation of web services and classifiedservices. http://moguntia.ucd.ie/repository/ws2003/.

[5] Wordnet dictionary. http://www.cogsci.princeton.edu/∼wn/.

[6] G. Antoniol, G. Canfora, G. Casazza, A. De Lucia, andE. Merlo. Recovering traceability links between code anddocumentation.IEEE Transactions on Software Engineering,28(10):970–983, October 2002.

[7] G. Birkoff. Lattice Theory. American Mathematical Society,Providence R.I., 1940.

[8] K. M. Chai, H. T. Ng, and H. L. Chieu. Bayesian onlineclassifiers for text classification and filtering. In M. Beaulieu,R. Baeza-Yates, S. H. Myaeng, and K. Jarvelin, editors,Proceedings of SIGIR-02, 25th ACM InternationalConference on Research and Development in InformationRetrieval, pages 97–104, Tampere, FI, 2002. ACM Press,New York, US.

[9] G. Di Lucca, M. Di Penta, and S. Gradara. An approach toclassify software maintenance requests. InProceedings ofIEEE International Conference on Software Maintenance,pages 93–102, Montreal, QC, Canada, Oct 2002.

[10] P. W. Eklund, editor.Browsing Search Results via FormalConcept Analysis: Automatic Selection of Attributes, volume2961/2004 ofLecture Notes in Computer Science. Springer,feb 2004.

[11] H.-M. Haav. An application of inductive concept analysis toconstruction of domain-specific ontologies. In B. Thalheimand G. Fiedler, editors,Emerging Database Research in EastEurope, Proceedings of the Pre-Conference Workshop of

9

VLDB 2003, volume 14/03 ofComputer Science Reports,pages 63–67. Brandenburg University of Technology atCottbus, nov 2003.

[12] D. Harman. Ranking algorithms. InInformation Retrieval:Data Structures and Algorithms, pages 363–392.Prentice-Hall, Englewood Cliffs, NJ, 1992.

[13] T. Joachims. Text categorization with support vectormachines: learning with many relevant features.EuropeanConf. Mach. Learning, ECML98, Apr. 1998.

[14] T. Joachims. Transductive inference for text classificationusing support vector machines. InProc. 16th InternationalConf. on Machine Learning, pages 200–209. MorganKaufmann, San Francisco, CA, 1999.

[15] T. Joachims. A statistical learning model of text classificationfor support vector machines. InProceedings of the 24thAnnual International ACM SIGIR Conference on Researchand Development in Information Retrieval, pages 128–136,2001.

[16] M. Kim and P. Compton. Formal concept analysis fordomain-specific document retrieval systems.Lecture Notesin Computer Science, 2256, 2001.

[17] D. D. Lewis.Representation and learning in informationretrieval. PhD thesis, Department of Computer Science,University of Massachusetts, Amherst, US, 1992.

[18] E. Maedche and G. Stumme. FCA-MERGE: Bottom-upmerging of ontologies. Jan. 06 2001.

[19] N. F. Noy, R. W. Fergerson, and M. A. Musen. Theknowledge model of Protege-2000: Combininginteroperability and flexibility.Lecture Notes in ComputerScience, 1937, 2000.

[20] S. P. Cimiano, J. Staab, and Tane. Deriving concepthierarchies from text by smooth formal concept analysis. InProceedings of the GI Workshop Lehren Lerner -Wissen -Adaptivitat (LLWA), 2003.

[21] M. Paolucci, T. Kawamura, T. R. Payne, and K. Sycara.Semantic matching of web services capabilities. InFirstInternational Semantic Web Conference (ISWC 2002),volume 2348, pages 333–347. Springer-Verlag, June 2002.

[22] G. Salton and C. Buckley. Term-weighting approaches inautomatic text retrieval.Information Processing andManagement, 24(5):513–523, 1988.

[23] F. Sebastiani. Machine learning in automated textcategorization.ACM Computing Surveys (CSUR),34(1):1–47, 2002.

[24] M. Siff and T. Reps. Identifying modules via conceptanalysis.IEEE Transactions on Software Engineering,25:749–768, Nov-Dec 1999.

[25] M. Stone. Cross-validatory choice and assesment ofstatistical predictions (with discussion).Journal of the RoyalStatistical Society B, 36:111–147, 1974.

[26] G. Tao.Using formal concept analysis for ontologystructuring and building. PhD thesis, 1992.

[27] V. N. Vapnik.The Nature of Statistical Learning Theory.Springer, New York, 1995.

[28] V. N. Vapnik.Statistical Learning Theory.JohnWiley, Sept.1998.

[29] B. G. R. Wille.Formal Concept Analysis. MathematicalFoundations, Springer Verlag, 1999.

[30] R. Wille. Formal concept analysis.Electronic Notes inDiscrete Mathematics, 2, 1999. Abstract of a Tutorial givenat the OSDA98, Amherst, MA, September 1998.

[31] Y. Yang. A study on thresholding strategies for textcategorization. In W. B. Croft, D. J. Harper, D. H. Kraft, andJ. Zobel, editors,Proceedings of SIGIR-01, 24th ACMInternational Conference on Research and Development inInformation Retrieval, pages 137–145, New Orleans, US,2001. ACM Press, New York, US.

[32] Y. Yang and X. Liu. A re-examination of text categorizationmethods. In M. A. Hearst, F. Gey, and R. Tong, editors,Proceedings of SIGIR-99, 22nd ACM InternationalConference on Research and Development in InformationRetrieval, pages 42–49, Berkeley, US, 1999. ACM Press,New York, US.

[33] L. Zeng, B. Benatallah, A. H. H. Ngu, M. Dumas,J. Kalagnanam, and H. Chang. Qos-aware middleware forweb services composition.IEEE Transactions on SoftwareEngineering, 30(5), May 2004.

10