melisa. an ontology-based agent for information retrieval ... · an ontology-based agent for...

� � � � � � � � � � � � �

� � � � � � � � � � � � � � �

� � � � � � � � � � � � �

� � � � � � � � � � � � � � � � � �

� � � � � � � � � � � � � �

� � � � �� ! " � � � �� # �

$ % � � � � & � � � � � � ' ( � � ) *+ � � , ! � �� '

MELISA. An ontology-based agent for informationretr ieval in medicine.

Jose María Abasolo, Mar io Gómez

Institut d’ Investigaciò en Intel.ligència Artificial (III A){ abasolo, mario} @iii a.csic.es

AbstractThis paper describes MELISA - MEdical Literature

Search Agent – a prototype of an ontology-basedinformation retrieval agent. We have designed amodular system that can be easily adapted to anothermedical lit erature sources or other professionaldomains. The major issues are the design of anarchitecture with three levels of abstraction, the use ofseparated ontologies and query models, and thedefinition of some aggregation operators to combineresults from different queries.Keywords: Knowledge-based mediation architectures

Introduction and motivationIn Internet, there are a lot of general-purpose

search engines, witch goal is to retrieve web pagesmatching some criteria. In addition, there are alsosome professional engines which results are literaturereferences: MedLine is a good example of this. It is alarge database with biomedical bibliographicreferences, so we think it is a good starting point todevelop an information retrieval agent for professionalpurposes: looking for literature references. In order toachieve this goal, we have carried out a process ofknowledge analysis with a professional in medicine,and we have developed a prototype based on theresulting medical ontology.

Without detailed knowledge of the collectionmake-up nor of the retrieval environment, most usersfind it difficult to formulate well-designed queries forretrieval purposes. In fact, as observed with Websearch engines, the users might need to spend largeamounts of time reformulating their queries toaccomplish effective retrieval. The user usually make afirst query, sees if the information is retrieved andexamine how useful it is for his need. Most of thetimes he has a large list of documents, intractable forhim, other times he had restricted so much the querythat the result is not sufficient. At this point he has toreformulate his query.

The purpose of this project is to solve this typicalproblem within a professional domain, in our casemedical lit erature retrieval. To make this weimplement different modules to generate queries,evaluate the results, reformulate the queries, if it isnecessary, and show the results to the user.

In this first approach we work with the MedLinedatabase to make the search, but keeping a structurethat will allow us to work, in the future, with different

databases or other search engines. Furthermore, to geta more practical realisation, we have adopted a well-known medical paradigm: “Evidence BasedMedicine”.

In section 1 we describe the system overview.Section 2 explains which type of information retrievalis needed to work with MedLine database, andMedLine itself. Section 3 describes the construction ofa medical ontology. Next section addresses the designof the query models. Section 5, 6 and give adescription of the generation and evaluation of thequeries. Section 7 shows an example. Finally wepresent some conclusions and future work.

1 OverviewHere we present a general scheme of the prototype

that we have designed to accomplish our objectives.We call it MELISA - MEdical LIterature SearchAgent.

1. Input interface: Allows the user to specify themain topics to perform the search, some typicalmedical categories and other search modifiers asdate of publication or evidence quali ty. All thisdata constitutes a Consultation. We can considera consultation as a very abstract - independent ofthe database-, conceptual, high level query.

2. Query Generation & Reformulation: Aconsultation becomes the input for the QueryGeneration module. This module is the core ofthe entire system. It takes a Consultation, theMedical Ontology and the Query Models; so itcan transform a Consultation into a collection of

Figure 1 General structure of MELISA

low-level, database dependent queries. We callthem Specific Queries. As we are going to seelater, there is another information level betweenConsultation and Specific Query; it is based in adomain knowledge model that we callConceptual Query, as it acts as a link between theConsultation and the Specific Query levels. Inaddition, this module can reformulate theSpecific Queries when the results are insuff icient.

3. Query Evaluation: The results of the SpecificQueries - basically a collection of literaturereferences - are then moved to the QueryEvaluation module. The function of this moduleis to assign some score to these references,according to some criteria, as the degree theyfulfil the user specifications or the evidencequali ty of the studies referred by the literature.Furthermore, the results for the Specific Queriesare joined within a different group for eachConceptual Query.

4. Fil ter & Combination: The literature referencesmust be filtered and combined to reach adefinitive score. This process is necessary due tothe nature of the queries, which make possiblerepeated references. First, we need to eliminatedifferent apparitions of the same reference;second, it is recommendable to delete referencesthat do not fulfil a minimum constraint-satisfaction criterion.

5. Output Interface: Last, the results areinteractively recombined to be shown to the userin an appropriate manner.

6. Query Model: It refers to information schemesthat represent queries at various abstractionlevels. We have mentioned two of these levelsbefore Consultation –the higher -, and SpecificQuery, - the lowest -. It allows us to make theagent the more independent from the context,being the context a concrete search engine likePubMed.

7. Medical Ontology: It contains some medicalknowledge used to generate the queries. It wasdesigned as a hierarchical tree, with a frame-based representation approach. This ontologymust be at some degree context free, but it has topoint elements of the search engines used by theQuery Generation module.

8. PubMed & MedLine: At this moment, we areworking only with one database – MedLine - andits associated web-based search engine – PubMed-. But it is important to remark that our goal is todevelop a general framework that allows us toreuse modules of this agent with other medicaldatabases and also other professional domains.

9. MeSH Browser: It is the access door to theMeSH Ontology, it allows to test the terms usedas keywords and give the user some additionalinformation about them.

2 MedlineIn this section we explain the structure of

MedLine’s documents, and how PubMed works. We

need this knowledge because the entire project hasbeen developed using the idea of working withmetadata collections, more useful for professionalpurposes.

Metadata collections are sets of documents withadditional information about the document, calledmetadata. Metadata is information on the organisationof the data, the various data domains, and the relationsbetween them. Summarising, metadata is ‘data aboutthe data’ . Common forms of metadata associated withtext include the author, the date of publication, thesource of publication, the length, and the documentgenre. This kind of metadata is usually calleddescriptive metadata. Another type of metadatacharacterises the subject matter that can be foundwithin the document’s contents. We refer this assemantic metadata.

MedLine Database is a metadata collection referredto biomedical articles. The articles stored in MedLinehave both Descriptive and Semantic Metadata. We willsee later the structure of MedLine’s documents.

To standardise semantic terms, many areas usespecific ontologies, collection of concepts, terms andrelations between them, used to describe theknowledge domain. In MedLine, this ontology iscalled MeSH (Medical Subject Headings).

2.1 MedLine’s documentsAs we have said, MedLine’s documents have more

information than the simple article reference. We workwith these fields to make a query. The most importantdifference in MedLine is the MH field that gives us themeaning (Semantic Data) of an article.

2.2 MeSH OntologyMeSH (Medical Subject Headings) is a medical

ontology made with 18000 categories. The structure ofMeSH is a polytree, a hierarchical structure where aterm can appear in different branches. MeSH objectscan be described with these properties:

MedLine has a set of subheadings, but each termhas only a subset of allowed subheadings.

The MH field of a document contains a MeSH termand optionally some of the allowed subheadings. Thisfield represents one of the topics of the medical

PMID: PubMed IdentifierUID Unique IdentifierTI The article’s tittleAU: The article’s authorsLA Language of publicationMH MeSH Term relatedPT Publication typeDA Date of acceptanceDP Date of publicationAB AbstractSO Source of publication

Name: Name of the termDefinition: Medical definition.Related Terms: Another terms related with this term.Subheadings: Allowed subheadings to modify andcomplement the meaning of this term.Position on the polytree: Fathers and son on all thelocations for the term.

reference. In addition, one MeSH term can be flaggedas a Major Topic – the most representative -.

2.3 Search ModifiersPubMed allows performing different types of

searches for a keyword by using some constraints thatwe call search modifiers. Different search modifiersimpose different constraints on the search, restrictingthe set of data fields used to carry out the search.

MAJR, MH:NOEXP and MH can be applied to

MeSH terms. TI and TW can be applied to any wordor expression. PT can only be applied to a set ofallowed values.

As we will see later, our system uses thesemodifiers to perform multiple queries for a term,looking at the medical ontology to determine whichtype of modifiers are allowed. The evaluationprocedure assigns scores to articles according to thesearch modifier used in the associated query.

3 Medical OntologyOur goal is to develop a system able to work with

diverse information sources. We think that ontologiesare the best approach for information retrieval tasks inso heterogeneous environments, supporting bothstructured and unstructured data.

We have carried out a knowledge analysis processworking along with an expert. The result of thisprocess consists of a collection of medical conceptsthat the expert considers relevant in medical practice,and some search terms for these concepts. Then thesystem translates them into a conceptual category-based structure. The categories were iteratively refinedand the search terms systematicall y tested to obtaingood results.

There are two main desirable properties to have inmind when designing the medical ontology: It shouldbe an appropriate knowledge representation of theworld, and it must point to a good set of terms in thesearch environment. Another aspect of interest, is therole the ontology can play as a user guide to specifyuseful queries.

We use a frame-based approach. From this point ofview, we represent the knowledge into a hierarchicaltree, with classes, subclasses and instances of theseclasses. Each object has some slots or properties thatwe can define as a data type. The descendants of aclass inherits its slots and values

We distinguish three abstraction levels.

- . / 0 1 2 / 34 5 6 7 8 / . 9 7 2 : 7 ; 9 4 5 <4 5 6 7 8 / . 9 7 2 : 7 = > ? 7 . @ 2= 0 A B 7 5 ; 9 C 8 2D 0 A < 9 4 5 6 / C ? E 1 7 2The first level refers to main areas of medical issues.These groups serve us as an organisation scheme, analso, it allow us to apply different treatments todifferent category groups. For example, we can usedifferent weights to score articles, according to therelative importance of the categories they belong to.

The medium level is constituted by a collection ofmedical categories. They are probably the mostimportant concepts, as they have been chosen tocapture issues of interest for the medicineprofessionals when looking for literature references ingeneral, and from an EBM practical approach inparticular. The data structures at this level have topoint elements of the third level.

MeSH terms are our particular representation ofterms existing in the MeSH Ontology (the ontologyused by Medline its associated search engine,PubMed). We have developed a module to captureand represent these terms in our ontology.

Next, we present the first two levels of theontology and a complete description of one medicalcategory and its related MeSH terms in order tounderstand this knowledge representation

F G H I J K LJ L K M M

G N I H G O J GP Q K L I R S

J L I O I J K LJ K R G T U V I G M

K O K L S M I M

G N I H G O J GI O R G T V K R I U O

T U U H G N I H G O J G P Q K L I R SF G H I Q F G N I H G O J G P Q K L I R SW U U V G N I H G O J G P Q K L I R SH I K T O U M I MR X G V K W SW V U T O U M S MK H N G V M G G Y Y G J R MV I M Z Y K J R U V MH G J I M M I U O R V G G MW U L I J S F K Z I O TJ U M R K O K L S M I M

T Q I H G L I O G MO Q V M I O TG N I H G O J G [ K M G H F G H I J I O GV G N I G \

We have defined a class at the top of the tree, calledMedical Class, and four subclasses to represent groupsof categories: Evidence quality, clinical categories,analysis and evidence integration, each one with alittl e number of instances, one for each medicalcategory.

Some slots are single data types, but there areanother ones that are references to other objects of theontology.

MAJR: Searches documents having this MeSH term asa Major Topic.MH:NOEXP: Searches documents with this MeSH term,without expanding the search through the descendantsof that MeSH term in the polytree.MH: Like the previous modifier, but expanding thesearch through its descendants.TI: Searches for documents having this term in the tittleTW: The same, plus looking for the term in the abstractPT: Searches documents with this term as PublicationType

Figure 2. Classes and instances of the medical ontology

MEDICAL_CLASS is-a-class Name: String ClassName: String Description: String MeSH_Terms: List-of MESH_TERM Subheadings: List-of SUBHEADING Related subheadings: List-of SUBHEADING Related_MeSH_Terms: List-of MESH_TERM Alternative_Terms: List-of String Publication Types: List-of PUBLICATION_TYPE

Every subclass of the Medical Class specifies aclass name; the rest of the slots are inherited withoutvalue, so we do not represent these slots.

Each Medical Category was represented as aninstance of one of the four groups of categories; forexample, we define a category for medical guidelines,belonging to the group called Evidence- Integration.

At the lowest abstraction level, we define astructure called MESH_TERM, to represent the MeSHterms, but only those terms used by our particularmedical ontology. Note that this structure is verysimilar to the one described in the MeSH ontology.

We define two objects as single data types –strings- to represent subheadings and publicationtypes. But it is important to remember that thesestrings can be only allowed PubMed publication typesand subheadings.

As other search engines may need some differentdata, we can adapt this ontology to suit new concepts,by modifying or adding new concepts.

4 Query modelTo understand the following sections we give now

some definitions of data structures that we call querymodels, because they represent information aboutqueries. First of all, we consider three levels, as in themesh ontology.

The first level – Consultation - groups all theinformation needed to perform a search.

The conceptual queries are directly associated withthe medical categories in the ontology, one conceptualquery is needed for each medical category included ina search.

Last, specific queries are suited to define andcapture the results of the real, physical queries. Eachone is the result of combining some keywords with asearch term given by the ontology, plus some searchmodifier.

These models drive the construction of queries, aswell as their evaluation. For this reason, each levelcontains a collection of structures of the level below. Itallows the query generation module to go down duringthe generation of the queries, and reverse whenevaluating and combining the results of the queries.

The idea of decomposing the query model in threeabstraction levels is to facilit ate reusabili ty: workingwith other search engines and other domains. Only thelowest level depends on the specific search engine.The higher level is absolutely independent, and themedium level is dependent only because theirassociation with the medical categories defined in theontology.

4.1 ConsultationA consultation is the representation of the user’s

need. The user gives these elements in the consultationwindow. Let us see which are these elements and theirmeaning.

The medical categories are selected from thosecategories defined in the medical ontology. The list ofconceptual queries is made by the Agent, during thequery generation procedure. The other fields are givenby the user.

4.1.1 KeywordsThe keywords are the core of the search process.

They are words or expressions representing the maintopics to drive the search. Any string is allowed as akeyword, but it is very recommendable to use validMeSH terms as keywords, so the search may obtainbetter results, and the user should apply somesubheadings to restrict the search. These keywords arerepresented in a similar manner as the MeSH Terms,plus some additional information:

CONSULTATIONKeywords : List-of StringMedical_Categories: List-of MEDICAL_CATEGORYFrom_Year: IntegerTo_Year: IntegerAbstract: BooleanConceptual Queries: List-of CONCEPTUAL_QUERY

MESH_TERM is-a-className: StringMeSH_Description: StringFathers: List-of MESH_TERMSons: List-of MESH_TERMSubheadings: List-of SUBHEADING

Figure 3 Query generation process

] ^ _ ` a b c d c e ^ _ ] ^ _ f g h c a d b i a g j e g ` k h g f e l e f i a g j e g `

GUIDELINES is-an-instance-ofEVIDENCE_INTEGRATION Name: Guidelines Description: MeSH_Terms: Guidelines, “Practice Guidelines”,“Clinical Protocol” Publication_Type: guideline, “practice guideline” Related_MeSH_Terms: “Guideline Adherence”

CLINICAL_CATEGORY is-a-subclass-ofMEDICAL_CLASS ClassName: “Clinical categories”

EVIDENCE_INTEGRATIONis-a-subclass-ofMEDICAL_CLASS ClassName: “Integration of the evidence”

EVIDENCE_QUALITY is-a-subclass-ofMEDICAL_CLASS ClassName: “Evidence quality”

ANALYSIS is-a-subclass-of MEDICAL_CLASS ClassName: Analysis

m n o p q r st u v w x y z z { | } w y ~ v � { � w � � � w } � � � � w � � z �� | � w � � � � � � � � � � � � � � � � � � � � � � � � � � � � w z � } v ¡ � v � ¢ � � � � � � � � £ ¤ � � ¤ � � � � ¥ � � � ¤ � � � � ¥ ¤ � � ¦ � § ¨ � � � © �u | � { w } z � ª � « � ¬ ® ¦ n § ¨ ¯ � n r ¦ � ° ± ¤ � � � � � ¥ ¤ � � � ¤ � � © � � ¤ � �¦ � § ¨ � � ¤ � ² � ³ ´ �� ¢ z � ª � « � ¬ ® ¦ n § ¨ ¯ � n r ¦ � § � � � � ¥ ¤ � � � ¤ � � © � � ¤ � � ¦ � § ¨� � ¤ � ² � ³ ´ �� µ � { � ª � « � ¬ ® § ¶ · ¨ n ¸ s £ ¹ º � ¸ ² ² � � � � � » ¼ � � ± � � � ³ � ¥ � � ¤ � � �¤ � � © � ½ w x ¾ ¿ � ª � « � ¬ ® ¦ n § ¨ ¯ � n r ¦ � r � ² ± ¤ � � ¦ � § ¨ ¤ � � © � �À Á y y v � v � ¢ | x Â v w x y z �� µ � { z w x � ª � « � ¬ ® § ¶ · ¨ n ¸ s £ ¹ º � § � ² � Ã ¤ � � ¼ ´ ¤ � � » � � � ¥ � � ©� µ � { ¥ � � ² � �� w � � � Ä Å Æ Ç � � £ ¤ � � ¤ � » � � ¥ ¤ � � ¤ � � © � � ± È ± ² � � ¦ � § ¨ ¤ � � © �

Note that Description, Fathers, Sons, Subh,Subhsel and RelKW are only available if the keywordis a valid MeSH term. All this information is used laterto generate and reformulate the queries.

4.1.2 Medical CategoriesThe user can select some categories from the

Medical Categories described at the previous section,which allows the user to focus the search to their maininterest areas. All this categories may be selected withindependence of the others, but those categories aboutthe quali ty of the evidence. In the last case, the usershould only choose the minimum quality degree of theevidence.

4.1.3 Special FiltersThis is not medical information, but descriptive

information. There are three filters:- Year_From: This filter take out of the searchdocuments published before this year- Year_To: Take out documents published after thisyear.- Abstract: When true, the search process retrievesonly documents with abstract.

4.2 Conceptual QueriesA conceptual query is a structure between

consultation and specific queries. It not depends on thesearch engine. A conceptual query has these elements:

4.3 Specific QueriesSPECIFIC_QUERY

Search_Term: one of the terms described in theMedical Category for the Conceptual Query in the upperlevel’

Search_Concept: belongs-to {MeSH_Term,Subheading, Related_MeSH_Term, Related_subheading,Alternative_Term, Publication Type} ‘Refers to the typeof term in the Medical Ontology’

Search_Modifier: belongs-to {MAJR, MH:NOEXP,MH, TI, TW}

Query_String: String ‘Is the string that w ill be send tothe search engine. It results of combining the search_term w ith the keywords and the f ilters specif ied in theconsultation’

Retrieved_Documents: List-of DOCUMENT ‘List ofdocuments retrieved by this query. We store only the UIDbecause it expends less memory and it is easier to workw ith. The rest of the information of an article w ill be takenin the moment of the visualisation of the document’

This is the low-level structure. It is really the one that

works with a search engine, so it depends on thesearch engine.

5 Generation and reformulation ofqueries

Here we describe the process of transforming aconsultation into a collection of specific queries, andhow these queries are sent to the search engine. Beforeexplaining this process in detail , let us discussdifferent approaches to the query generation task.

As we have said before, a consultation impliessome conceptual queries, one for each selectedmedical category. Furthermore, we know that amedical category links to a collection of MeSH andnon-MeSH terms, plus other concepts like publicationtypes. In addition, each term can be appended withsome search modifiers. Thus, our system has toaddress a large amount of information. Therefore, wemust design a method to minimise time cost andmaximise information quali ty.

We have considered three aspects to design thequery generation strategy:

- To send all queries in parallel, so we avoidwaiting for one query to send another.

- To use a short retrieval format, during thegeneration and the evaluation procedure we retrieveonly the document references (UID’s) to economisetime and space resources

- To perform short queries – with a few number ofsearch terms- rather than long queries. This allowsworking only with UID’s, as the evaluation procedurecan score documents according to the queries theyappear in.

5.1 Decomposition p rocessWe can see the process as an iterative

decomposition process, from the most abstract level –consultation - until the lowest level - specific queries.As we will see later, the evaluation and combinationprocess can be considered the inverse, as the systemsproceeds by integrating results from different queriesinto more general objects.

The first step consists on the generation of theconceptual queries, one for every medical category

É Ê Ë Ì Í Î Ì Ï Ð Ñ Ð Ì Ò Ó Ê Ô Ê Ó Õ

Ë Ì Ò Ï Ö Ó Ñ × Ñ Ð Ì Ò

Ë Ì Ò Ë Ê Î Ñ Ö × ÓØ Ö Ê Ù Ú Õ Ë Ì Ò Ë Ê Î Ñ Ö × ÓØ Ö Ê Ù Ú Û Ë Ì Ò Ë Ê Î Ñ Ö × ÓØ Ö Ê Ù Ú Ò

DECOM POSITION LEVEL 2

SPECIFIC QUERIESÏ Î Ê Ë Ð Ü Ð Ë Ø Ö Ê Ù Ð Ê Ï Ï Î Ê Ë Ð Ü Ð Ë Ø Ö Ê Ù Ð Ê Ï

CONCEPTUAL_QUERYMedical_Category: MEDICAL_CATEGORY ‘A

conceptual query is associated with only one MedicalCategory’

Specific_queries: List-of SPECIFIC_QUERY‘Stores all the specific queries generated for thisconceptual query, resulting of combining the keywordsand filters with all the terms of the medical categoryassociated to this Conceptual Query. This list is madeduring the query generation procedure (decompositionlevel II)’

Scored_documents: List-of SCORED_DOCUMENT‘Documents resulting of evaluating and combining thedocuments retrieved in the specific queries. This list iscreated during the query evaluation procedure’

Figure 3 Query generation process

selected by the user, and another extra conceptualquery to perform searches with only the keywords andno medical category. Each conceptual query takesthree parameters as inputs: the list of keywords, amedical category, and the special filters.

In the second step, the query generator executes adecomposition of each conceptual query into a set ofspecific queries. Each specific query is constructed bycombining the keywords and special filters of oneconceptual query with the items that represents themedical category for that conceptual query and itsallowed search modifiers. Every conceptual query canresult in a very different number and type of specificqueries, according to the structure of the medicalcategory associated with that conceptual query.

Keywords, special filters and terms related to themedical category are combined with the AND operator- the most restrictive one. After the evaluation process,if the results are not suff iciently good the generationprocess produces a new collection of specific queries,but now using the OR operator. At the moment we useonly a quantitative criterion, in short, reformulate if thenumber of retrieved documents is less than a fixedthreshold.

6 Combination and evaluation ofqueries’ results

As we have seen in the previous section, wegenerate a collection of specific queries for eachconceptual query. The specific query interacts with thesearch engine – or engines – and retrieves a list ofdocuments. In this section we explain how to join allthe documents corresponding to a conceptual query,plus the functions used to score documents.

Scoring documents inside a conceptual query canbe seen as assigning documents with a membershipvalue, referred to the medical category associated withthat conceptual query. So, the key is to understand themeaning of the search concepts defined in theontology, as well as the sense of the search modifiersthat can be applied to these concepts. Thus, wedistinguish some order relations that can be used witha membership meaning.a) Search concepts (specific query’s Type):Mesh_Terms>Related_Mesh_Terms>NoMesh_TermsSubheading > Subheading2Publication Typeb) Search modifiers (specific query’s Subtype):MAJR> MH:NOEXP > MH > TI > TW > no-modifier

To score documents, we define a data structurewith a numeric field for each search concept(Mesh_Score, Mesh2_Score, NoMesh_Score,Subh_Score, Subh2_Score, PubType_Score), and afield to store the overall score (Global_Score) of adocument in the conceptual query.

This scoring process has two steps: First, MELISAhas to score the documents inside a conceptual query.Second, MELISA has to combine the results ofdifferent conceptual queries, according to the user’sselection.

6.1 Scoring d ocuments inside a ConceptualQuery

After retrieving, MELISA has the followinginformation: lists of articles – only UID – retrieved inthe different specific queries. Now the agent needs tojoin all these lists into one list with some order. Toachieve this goal the system uses an aggregationfunction that enables us to join the information of allthe specific queries inside a conceptual query.MELISA uses an aggregation function to sort theresultant list.

6.1.1 Aggregation functionAt this point we need to calculate the importance of

a document (a) inside the conceptual query j. Thisimportance depends on the occurrences of thedocument in the different specific queries results. Theaggregation function is defined as follows for aconceptual query j:

∑∀

=Θi

ijij aca )()( θ [1]

Where a is the retrieved document, ci is the weightcoeff icient for a search modifier i. These coeff icientshave the following restrictions:

1=∑∀i

ic ; 0: ≥∀ ici

And θij is a membership function of documentsobtained with the search modifier i in the conceptualquery j.

6.1.2 Membership function

ij

kijkk

ij N

anba

∑∀=

)()(θ [2]

Where nijk(a) is the number of occurrences of thedocument a in specific queries with search modifier iand search concept k, in the conceptual query j. Nij is anormalisation coeff icient defined as follows:

∑∀

=k

ijkkij NbN [3]

Nijk is the number of specific queries with searchmodifier i and search concept k, generated for theconceptual query j. It represents the theoreticmaximum for nijk .

The weight coefficients bk for a search concept k,must accomplish the following constraint:

10: ≤≤∀ ÝÞß

6.1.3 Weight coefficientsWe have two sets of weight coefficients for the

search modifiers, depending on whether or not theconceptual query has a publication type (PT).

Ci Conceptual querywithout PT

Conceptual querywith PT

MAJR 0.3 0.15MH:NOEXP 0.25 0.125MH 0.2 0.1TI 0.13 0.065TW 0.08 0.04PT 0.5Without SearchModifier

0.04 0.02

Furthermore we have another set of weightcoeff icient for the search concepts:

bk

Mesh terms 1Related Mesh terms 0.7Non-Mesh terms 0.5Subheading 1Related Subheading 0.7Publication type 1

These values have been estimated empirically. Inthe current implementation of MELISA, the weightcoeff icients are static values, but in the future we planthe agent to learn these coefficients by interacting withthe user.

6.2 Combination o f documents fromdifferent conceptual queries

At this point, the system has a list of scoreddocuments for each conceptual query. MELISA allowsthe user to combine the results of different conceptualqueries. For this purpose, we need to compare andcombine the score of articles in different conceptualqueries.

This is not an easy task, because the conceptualqueries are associated with a medical category, andthese medical categories have not the same structure.This structure depends on the number of terms, used todefine a medical category, the specificity of theseterms and their type. A consequence of this is that notall the documents can easily get high scores, it dependson the conceptual query where they have beenretrieved. It is not the same a best document for aconceptual query with score 0.8 than a best documentwith score 0.2

We want a function that applies differentcorrections in the document scores according to theempirical maximum reached. For the categoriesobtaining a maximum score of one, no modification isneeded, for the categories with a lower maximum; weapply a greater correction to the score. For example,let us suppose we have a document with the followingscores.

Score Maximum Desired IncrementCategory 1 0.3 0.5 MediumCategory 2 0.1 0.1 LargeCategory 3 0.8 1 Small

To achieve our goal, we define the followingfunction:

jKjj

j

j aaN

aa ))(()(

)()( '

'

Θ=ΘΘ

=Θ ∑∀

[4]

Where N is the number of conceptual queries wewant to combine and Kj is the maximum of Θj(a).

With these new scores of the documents, MELISAsorts the resultant list of documents that will be shownto the user.

7 ExampleHere we present an example extracted from the real

medical practice.Let us suppose that the user looks for information

on current Levofloxacin treatments of the pneumonia.In addition he wants to know if there is some evidenceconcerning this (EBM), thus he should create the nextconsultation:Keywords:

‘Levofloxacin’‘Pneumonia’

Medical_categories: ‘Good evidence quali ty’‘Therapy’‘Recommendations based on the evidence’ ,‘Guidelines’‘Cost Analysis’

Special filters: - Documents accepted from 1960 to 2000- Only retrieve articles with abstract

Figure 4 shows the Consultation window for thisexample. At this window the user can add or deletekeywords, select medical categories and set specialfilters.

Each new keyword is automatically checkedagainst the MeSH ontology and the list of allowedsubheadings is shown if they are valid mesh terms.Information about keywords is shown in the MeSHWindow. Figure 5 shows the MeSH window thatcorresponds to the term ‘Levofloxacin’ . It is not anexact MeSH term, but is associated with the MeSHterm ‘Ofloxacin’ . At this point the user has two

Figure 4 Consultation window

options, to accept the MeSH term given by the MeSHBrowser or refuse this term and try with another one.

At this window, the user can also specify somesubheadings for the keyword.

If the user puts a term that is not a valid MeSHterm, the system advises him of this condition, andshows a list of similar valid MeSH terms. Then theuser can choose one of these suggested terms,nevertheless he can use his own, not Mesh term.

Figure 6 shows an example of this case. We havesimulated a human mistake, i.e. the user writes‘Levoflaxin’ in place of ‘Levofloxacin’ . The systemshows a list of terms, starting with ‘Levofloxacin’ , sothe user can choose the correct one.

Figure 6 Mesh window 2

After the first decomposition task, we obtain 6conceptual queries, one for each selected medicalcategory, and an additional conceptual query with onlythe keywords.1. Ofloxacin + Pneumonia +

GOOD_EVIDENCE_QUALITY2. Ofloxacin + Pneumonia + THERAPY3. Ofloxacin + Pneumonia + EBM4. Ofloxacin + Pneumonia + GUIDELINES5. Ofloxacin + Pneumonia + COST_ANALYSIS6. Ofloxacin + Pneumonia

Now, we will see in detail the conceptual querygenerated for the medical category ‘Good EvidenceQuali ty’ . First, we show the structure of this instance.

The second decomposition procedure generates 6specific queries for every MeSH term in the

conceptual query associated with ‘Good EvidenceQuali ty’ , i.e. with the term ‘Meta-Analysis’1. Ofloxacin * Pneumonia AND Meta-

Analysis[MAJR]2. Ofloxacin * Pneumonia AND Meta -

Analysis[MH:NOEXP]3. Ofloxacin * Pneumonia AND Meta -Analysis[MH]4. Ofloxacin * Pneumonia AND Meta -Analysis[TI]5. Ofloxacin * Pneumonia AND Meta -Analysis[TW]6. Ofloxacin * Pneumonia AND Meta -Analysis

As these medical category has four MeSH terms, itproduces 6 * 4 = 24 queries.

One specific query is generated for everypublication type1. Ofloxacin * Pneumonia AND Meta-Analysis[PT]2. Ofloxacin * Pneumonia AND "Randomised

Controlled Trial"[PT]3. Ofloxacin * Pneumonia AND “Clinical Trial,

Phase III " [PT]4. Ofloxacin * Pneumonia AND “Clinical Trial,

Phase IV" [PT]In total, 28 specific queries are generated for the

first conceptual query. Note that the results of eachspecific query must be combined later into a singlecollection of documents, evaluated and sorted bymeans of an aggregation function.

The process is repeated for the remainingconceptual queries. There is a special case, theconceptual query with no medical category. In thiscase, the system has to combine only the keywordswith their allowed modifiers and any search term fromthe medical categories.1. Ofloxacin[MAJR] * Pneumonia[MAJR]2. Ofloxacin[MH:NOEXP] *

Pneumonia[MH:NOEXP]3. Ofloxacin[MH] * Pneumonia[MH]4. Ofloxacin[TI] * Pneumonia[TI]5. Ofloxacin[TW] * Pneumonia[TW]6. Ofloxacin * Pneumonia

Where * represents an operator for combiningkeywords { AND, OR} .After scoring all the documents in all the queries, thesystem shows the documents better fulfil ling the userrequirements. The Results window (Figure 7) consistsbasically of a list of the documents retrieved, orderedaccording to the combined score. The user can selectsome of the categories used to perform the search inorder to combine all their documents in only one list.Note that only the basic information about thedocuments is shown at this window: authors, title,publication data and source of publication. We use ahypertext format to easily focus on any document. Theauthors’ information is a link to the extendedinformation about an article, showed in the Documentviewer (Figure 8). Furthermore, we insert two special

Figure 5 MeSH window

GOOD_EVIDENCE_QUALITY is-an-instance-ofEVIDENCE_QUALITY Name: Good evidence quality MeSH_Terms: Meta-Analysis, Randomized ControlledTrials", Clinical Trials, Phase III", Clinical Trials, PhaseIV" Publication_Type: Meta-Analysis, RandomizedControlled Trial", Clinical Trial, Phase III", Clinical Trial,Phase IV"

links: one to append a document to a list of thedocuments he wants to order; another to show relatedarticles suggested by MedLine. Figure 7 showsdocuments corresponding to the default set up; theseare the best documents resulting of combining all thecategories selected in the consultation window.Article.

Remember that MELISA only works withdocument’s UID (unique identifier), thus the systemhas to retrieve the complete information about thearticles before to display them. Since this is a very costexpensive process, the system shows only twentydocuments by page

Figure 7 Result window

The Document viewer extends the informationoffered in the Results window, including the abstract,the mesh terms and the major topics for an article, plusthe scores for all the categories and the combinedscored. Figure 8 shows the Document viewer when thefirst document in the Results window is selected.

8 Conclusions and future workBefore addressing evaluation of performance and

future work, let us discuss the reasons to adopt thisapproximation to the evaluation procedure instead ofthose typically used in IR

Typical IR functions compare documents toqueries based on term frequencies. They are bestsuited to work with unstructured data, but can also beused to deal with structured data, as in MedLine. Theproblem is that the system needs to retrieve, store andanalyse a big amount of information, so it may beunpractical or too much resource consuming. Wepresent a method to avoid this problem: instead ofretrieve a document only one time with a completedescription to score it, our system generates a bigquantity of different queries for each concept,retrieving only the identification number of thedocuments in the database. Our scoring procedure isbased in the quantity and characteristics of the querieswhere documents appear. We think this approach takebenefit of the Internet properties, since it is amenableto parallelize the execution of the information retrievaltask.

Currently we are evaluating the systemperformance by means of comparing the resultsoffered by our system against the results obtained byhumans using the MedLine web-based searcher:PubMed. An expert has proposed us a collection ofreal medical cases. Then we have translated thesecases into MELISA’s consultations. We have testedtwo different evaluation functions:

1. Multi-valued logic using t-norm operators1.2. Aggregation operatorsAn expert has evaluated the results offered by our

system applying a classification of the documentsretrieved by MELISA. He can assign a document toone of these three categories:

I. The document do not satisfies the userrequirements

II . The document is related with the requirementsbut do not satisfies them enough

III . The document satisfies enough the userrequirements

We compare the frequencies of each group todetermine the relative performance of the evaluationfunctions. We will present a systematic analysis of theresults in future papers; nevertheless, we present someconclusions about the comparison of both methods.

The approach based on multi-valued logic seems tobe better to combine documents between conceptualqueries. The problem is on the loss of informationwhen calculating scores within a conceptual query.

The finally adopted approach - aggregationoperators - is very accurate when evaluatingdocuments within a conceptual query. Its weak pointlies on combining documents from differentcategories, because every category has a differentstatistical distribution of the scores. The evaluationprocedure confirms our analysis; we detect a loss ofprecision when incrementing the number of categoriescombined to obtain the overall result.

1 A continuous multivalued logic based on the

Minimun t-norm and t-conorm. The OR operator -defined as the maximum –is used to combine scoreddocuments within a conceptual query, and ANDoperator - defined as the minum - to combinedocuments between different conceptual queries.

Figure 8 Document viewer

MELISA is able of processing, scoring andcombining a large amount of medical literature in anacceptable way, avoiding the user of a tedious andimprecise work. However, we plan to do some work toimprove and extend the capabiliti es of our informationagent in the future.

First, we will develop user profiles, in order todevelop a system able to adapt to different users.Second, we will add some capabiliti es to allow thesystem to work with other medical databases availableon-line. Third, we should add capabiliti es to handledifferent evaluation functions, as fuzzy measures andWOWA operators. Fourth, we must study morecomplex criteria to determine when reformulating thespecific queries. Five, we will compare the differentevaluation functions and combination of thesefunctions. As we can distinguish evaluation within acategory from evaluation between categories, we canapply different methods to both separated functions.Finally, we think that it is very interesting toincorporate methods to learn the weight coefficientsand the user profile.

Acknowledgements

The authors would like to thank the SpanishScientific Research Council for their support. Thiswork has been developed under the SMASH project(TIC96-1038-C04-01) and the IBROW project (IST-1999-190005).

Special thanks to Enric Plaza for their suggestionsand Albert Verdaguer for his assistance in the analysisof the medical domain.

9 References[1] Arens, Y., Knoblock, C.A. & Shen, W.M.

Query Reformulation for DistributedInformation Gathering

[2] Baeza-Yates, R. & Ribeiro-Netop, B. ModernInformation Retrieval

[3] Boyan, J., Freitag, D. & Joachims, T. AMachine Learning Architecture for OptimisingWeb Search Engines

[4] Cardelli , L. & Davies, R. Service Combinatorsfor Web Computing. SRC Research Report,June 1, 1997

[5] Chen, C. Structuring and Visualising theWWW by Generalised Similarity Analysis

[6] Chen, C. & Czerwinski, M. From LatentSemantics to Spatial Hypertext –An IntegratedApproach

[7] Edwards, P., Bayer, D., Green, C.L. & Payne,T.R. Experience with Learning Agents whichManage Internet-Based Information

[8] Feinstein, A.R. & Horwitz, R.I. Problems in the“Evidence” of “Evidence-based Medicine”

[9] Lawrence, S. & Giles, C.L. Context and PageAnalysis for Improved Web Search

[10] Mahalingam, K. & Huhns, M.N. An OntologyTool for Query Formulation in an Agent-BasedContext

[11] Montbriand, J. Extending and Controlli ngSherlock and the Find by Content Libraries.Apple Developers Technote 1141

[12] Oates, T., Nagendra Prasad, M.V., Lesser, V.R.& Decker, K. A Distributed Problem SolvingApproach to Cooperative InformationGathering

[13] Payne, T.R. & Edwards, P. LearningMechanisms for Information Filtering Agents

[14] Pratt, W. Dynamic Organization of SearchResults Using the UMLS. Proceedings of theAmerican Medical Informatics (AMIA) FallSymposium (Formerly SCAMC), 1997

[15] Pratt, W. Physician’s Information Customizer(PIC): Using a Shareable User Model to Filterthe Medical Literature. Proceedings of theInternational Conference on MedicalInformatics (MEDINFO95), July 1995