a relevance model for a data warehouse contextualized with documents

12
A relevance model for a data warehouse contextualized with documents Juan Manuel Pérez * , Rafael Berlanga, María José Aramburu Universitat Jaume I, Campus de Riu Sec, E-12071 Castelló de la Plana, Spain article info Article history: Received 24 July 2008 Received in revised form 17 October 2008 Accepted 9 November 2008 Available online 9 January 2009 Keywords: Relevance-based language model Data warehouse Text-rich document collection abstract This paper presents a relevance model to rank the facts of a data warehouse that are described in a set of documents retrieved with an information retrieval (IR) query. The model is based in language modeling and relevance modeling techniques. We estimate the relevance of the facts by the probability of finding their dimensions values and the query keywords in the documents that are relevant to the query. The model is the core of the so-called contextualized warehouse, which is a new kind of decision support system that combines structured data sources and document collections. The paper evaluates the relevance model with the Wall Street Journal (WSJ) TREC test subcollection and a self-con- structed fact database. Ó 2008 Elsevier Ltd. All rights reserved. 1. Introduction During decades the information retrieval (IR) area has provided users with methods and tools for searching interesting pieces of text among huge document collections. However, until very recently these techniques have been implemented apart from databases due to the very different nature of the objects they manage: whereas data is well-structured with well-defined semantics, texts are unstructured and require approximate query processing (Baeza-Yates & Ribeiro-Neto, 1999). Nowadays, corporate information systems need to include internal and external text-based sources (e.g., web documents) into the information processes defined within the organization. For example, decision support systems would greatly benefit from text-rich sources (e.g., financial news and market research reports) as they can help analysts to understand the histor- ical trends recorded in corporate data warehouses. Opinion forums and blogs are also valuable text-sources that can be of great interest for enhancing the decision making processes. Unfortunately, there are scarce works in the literature concerned with a true integration of data and document retrieval techniques. Recent proposals in the field of IR include language modeling (Ponte & Croft, 1998) and relevance modeling (Lavrenko & Croft, 2001). Language modeling represents each document as a language model. Thus, documents are ranked according to the probability of emitting the query keywords from the corresponding language model. Relevance modeling estimates the joint probability of the query’s keywords and the document words over the set of documents deemed relevant for that query. In this paper, we apply the language modeling and relevance modeling approaches to develop a new model that estimates the relevance of the facts stored into a data warehouse with respect to an IR query. These facts are well-structured data tu- ples, whose meaning is described by a set of documents retrieved with the same IR query from a separate text repository. The proposed relevance model is the core of the contextualized warehouse described in Pérez, Berlanga, Aramburu, and Pedersen (2008). However, the topic of Pérez et al. (2008) was the multidimensional model of the contextualized warehouse, rather than the relevance model. In the current paper, we describe the relevance model in detail, and we compare it with the 0306-4573/$ - see front matter Ó 2008 Elsevier Ltd. All rights reserved. doi:10.1016/j.ipm.2008.11.001 * Corresponding author. Tel.: +34 964 728368; fax: +34 964 728435. E-mail addresses: [email protected] (J.M. Pérez), [email protected] (R. Berlanga), [email protected] (M.J. Aramburu). Information Processing and Management 45 (2009) 356–367 Contents lists available at ScienceDirect Information Processing and Management journal homepage: www.elsevier.com/locate/infoproman

Upload: juan-manuel-perez

Post on 05-Sep-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A relevance model for a data warehouse contextualized with documents

Information Processing and Management 45 (2009) 356–367

Contents lists available at ScienceDirect

Information Processing and Management

journal homepage: www.elsevier .com/ locate / infoproman

A relevance model for a data warehouse contextualized with documents

Juan Manuel Pérez *, Rafael Berlanga, María José AramburuUniversitat Jaume I, Campus de Riu Sec, E-12071 Castelló de la Plana, Spain

a r t i c l e i n f o

Article history:Received 24 July 2008Received in revised form 17 October 2008Accepted 9 November 2008Available online 9 January 2009

Keywords:Relevance-based language modelData warehouseText-rich document collection

0306-4573/$ - see front matter � 2008 Elsevier Ltddoi:10.1016/j.ipm.2008.11.001

* Corresponding author. Tel.: +34 964 728368; faE-mail addresses: [email protected] (J.M. Pére

a b s t r a c t

This paper presents a relevance model to rank the facts of a data warehouse that aredescribed in a set of documents retrieved with an information retrieval (IR) query. Themodel is based in language modeling and relevance modeling techniques. We estimatethe relevance of the facts by the probability of finding their dimensions values and thequery keywords in the documents that are relevant to the query. The model is the coreof the so-called contextualized warehouse, which is a new kind of decision support systemthat combines structured data sources and document collections. The paper evaluates therelevance model with the Wall Street Journal (WSJ) TREC test subcollection and a self-con-structed fact database.

� 2008 Elsevier Ltd. All rights reserved.

1. Introduction

During decades the information retrieval (IR) area has provided users with methods and tools for searching interestingpieces of text among huge document collections. However, until very recently these techniques have been implementedapart from databases due to the very different nature of the objects they manage: whereas data is well-structured withwell-defined semantics, texts are unstructured and require approximate query processing (Baeza-Yates & Ribeiro-Neto,1999).

Nowadays, corporate information systems need to include internal and external text-based sources (e.g., web documents)into the information processes defined within the organization. For example, decision support systems would greatly benefitfrom text-rich sources (e.g., financial news and market research reports) as they can help analysts to understand the histor-ical trends recorded in corporate data warehouses. Opinion forums and blogs are also valuable text-sources that can be ofgreat interest for enhancing the decision making processes. Unfortunately, there are scarce works in the literature concernedwith a true integration of data and document retrieval techniques.

Recent proposals in the field of IR include language modeling (Ponte & Croft, 1998) and relevance modeling (Lavrenko &Croft, 2001). Language modeling represents each document as a language model. Thus, documents are ranked according tothe probability of emitting the query keywords from the corresponding language model. Relevance modeling estimates thejoint probability of the query’s keywords and the document words over the set of documents deemed relevant for that query.In this paper, we apply the language modeling and relevance modeling approaches to develop a new model that estimatesthe relevance of the facts stored into a data warehouse with respect to an IR query. These facts are well-structured data tu-ples, whose meaning is described by a set of documents retrieved with the same IR query from a separate text repository.

The proposed relevance model is the core of the contextualized warehouse described in Pérez, Berlanga, Aramburu, andPedersen (2008). However, the topic of Pérez et al. (2008) was the multidimensional model of the contextualized warehouse,rather than the relevance model. In the current paper, we describe the relevance model in detail, and we compare it with the

. All rights reserved.

x: +34 964 728435.z), [email protected] (R. Berlanga), [email protected] (M.J. Aramburu).

Page 2: A relevance model for a data warehouse contextualized with documents

J.M. Pérez et al. / Information Processing and Management 45 (2009) 356–367 357

relevance-based language model techniques that support it. The paper provides a series of experiments over a well-known IRcollection in order to demonstrate that the ranking of facts provided by the model is good enough for helping analysts intheir tasks. This evaluation is completely new and has not been previously published anywhere. The review of the languagemodeling and relevance modeling approaches included in this paper is also an original contribution.

The rest of the paper is organized as follows: Section 2 overviews the contextualized warehouse. Section 3 reviews thelanguage modeling and the relevance modeling IR approaches. Section 4 presents the contextualized warehouse relevancemodel and Section 5 evaluates it. Finally, Section 6 discusses some conclusions and future lines of work.

2. The contextualized warehouse

A contextualized warehouse is a new kind of decision support system that allows users to obtain strategic information bycombining sources of structured data and documents. Fig. 1 shows the architecture of the contextualized warehouse pre-sented in Pérez et al. (2008). Its main three components are a corporate data warehouse, a document warehouse and thefact extractor module. Next, we briefly describe these components:

(a) The corporate data warehouse integrates data from the organization’s structured data sources (e.g., its different depart-ment databases). The integrated data is organized into multidimensional data structures, denoted OLAP cubes (On-LineAnalytical Processing) (Codd, 1993). In these cubes, the data is divided into facts, the central entities/events for thedesired analysis, e.g., the sales, and hierarchical dimensions, which characterize the facts, e.g., the products sold andthe grouping of products into categories. Typically, the facts have associated numerical measures (e.g., profit) and anal-ysis operations aggregate the fact measure values up to a certain level of detail, e.g., total profit by product category andmonth (Pedersen & Jensen, 2005). The facts can be conceptually modeled as data tuples whose elements depict dimen-sion and measure values. For instance, the fact f ¼ ðProduct:ProductID ¼ fo1;Customer:Country ¼ Japan; Time:Month ¼1998=10; SUMðProfitÞ ¼ 300;000$Þ could represent the total profit for the sales of the product fo1, made to Japanesecustomers during October 1998: 300,000$. The OLAP cubes can be stored by following either a so-called ROLAP and/or a so-called MOLAP approach. ROLAP stands for Relational OLAP, since the data is stored in relational tables. In orderto map the multidimensional data cubes into tables, different logical schemas have been proposed. The star and thesnowflake schemas are the most commonly used. The star schema consists of a fact table plus one dimension table foreach dimension. Each tuple in the fact table has a foreign key column to each of the dimension tables, and some numericcolumns that represent the measures. The snowflake schema extends the star schema by normalizing and explicitlyrepresenting the dimension hierarchies. In the Multidimensional OLAP (MOLAP) alternative, special data structures(e.g., multidimensional arrays) are used for the storage instead. The construction of a corporate warehouse for struc-tured data has been broadly discussed in classical references like (Inmon, 2005).

(b) The document warehouse stores the unstructured data coming from internal and external sources. These documentsdescribe the context of the corporate facts. They provide users with additional information related to the facts, veryuseful to understand the results of the analysis operations. For instance, a petrol crisis reported in an economy articlemay explain a sales drop.

(c) The objective of the fact extractor module is to relate the facts of the corporate warehouse with the documents thatdescribe their contexts. This module first identifies dimension values in the metadata and textual contents of the doc-uments, and then, links each document with those facts that are characterized by the same dimension values. In Pérez(2007) we showed that the information extraction techniques proposed in Danger, Berlanga, and Ruiz-Shulcloper

Fig. 1. Contextualized warehouse architecture.

Page 3: A relevance model for a data warehouse contextualized with documents

Table 1Example R-cube.

Fact Id Dimensions Measures R-cube special dimensions

Product Id Country Month Profit R Ctxt

f1 fo1 Cuba 1998/03 4,300,000$ 0.05 d0:00523 ; d0:005

47

f2 fo2 Japan 1998/02 3,200,000$ 0.1 d0:0250

f3 fo2 Korea 1998/05 900,000$ 0.2 d0:0484

f4 fo1 Japan 1998/10 300,000$ 0.4 d0:04123 ;d

0:087

f5 fo2 Korea 1998/11 400,000$ 0.25 d0:087 ;d0:01

69

358 J.M. Pérez et al. / Information Processing and Management 45 (2009) 356–367

(2004), Llidó, Berlanga, and Aramburu (2001) can be used for identifying the dimension values in the documents con-tents. The fact extractor module would link the example fact f to those documents of the warehouse that report eventslocated in Japan during October 1998.

In a contextualized warehouse, the user specifies an analysis context by supplying a sequence of keywords (i.e., an IRquery Q like ‘‘petrol crisis”). The analysis is performed on a new type of OLAP cube, called R-cube, which is materializedby retrieving the documents and facts relevant for the selected context. Table 1 shows an example R-cube. Each row repre-sents a fact fi, and each column a dimension. In this case, the facts represent the total profit of the sales per product, countryand month. R-cubes have two special dimensions: the relevance (R) and the context (Ctxt) dimensions. In the relevance dimen-sion, each fact has a numerical value representing its relevance with respect to the specified context (e.g., how important thefact is for a ‘‘petrol crisis”). Thereby the name R-cube (Relevance cube). The context dimension links each fact to the set ofdocuments that describe its context. In the R-cube, each dr

j denotes a document whose relevance with respect to the analysiscontext is r.

The most relevant facts of our example R-cube are the facts f4 and f5, which involve the sales made to Japanese and Koreancustomers during the months of October and November 1998. By studying the documents associated to these facts, e.g., themost relevant d7, we may find out a report on a petrol crisis that affected Japan and Korea during the second half of 1998.Probably, this report could explain why the sales represented by f4 and f5 experimented the sharpest drop.

The formal definition of the R-cube’s multidimensional data model and algebra was given in Pérez et al., 2008. A prototypeof a contextualized warehouse was presented in (Pérez, Berlanga, Aramburu, & Pedersen, 2007). This paper presents the IRmodel of the contexutalized warehouse. Given a context of analysis (i.e., an IR query), we first retrieve the documents of thewarehouse by following a language modeling approach. Then, we rely on relevance modeling to rank the facts described inthe retrieved documents. Language modeling and relevance modeling establish a formal foundation based on probabilitytheory, which is also well-suited for studying the influence of the R-cubes algebra operations in the relevance values ofthe facts (Pérez et al., 2008).

3. Language models and relevance-based language models

The work on language modeling estimates a language model mj for each document dj. A language model is an stochasticprocess which generates documents by emitting words randomly. The documents dj are then ranked according to the prob-ability PðQ jmjÞ of emitting the query keywords Q from the respective language model mj (Ponte & Croft, 1998).

The calculation of the probability PðQ jmjÞ differs from model to model. In Song andCroft (1999) the query Q is representedas a sequence of independent keywords qi, Q ¼ q1; q2; . . . ; qn (let qi 2 Q mean that the keyword qi appears in the sequence Q);and the probability PðQ jmjÞ is computed by

PðQ jmjÞ ¼Y

qi2Q

PðqijmjÞ ð1Þ

Song and Croft (1999) propose to approach the probability PðqijmjÞ of emitting the keyword qi from mj by smoothingthe relative frequency of the query keyword in the document dj. Their approach avoids probabilities equal to zero inPðQ jmjÞ when a document does not contain all the query keywords. They make the assumption that finding a keyword ina document might be at least as probable as observing it in the entire collection of documents, and estimate this probabilityas follows:

PðqijmjÞ ¼ ð1� kÞ freqðqi;djÞjdjjw

þ kcwfqi

coll sizewð2Þ

In formula (2) freqðqi; djÞ is the frequency of the keyword qi in the document dj. The term jdjjw denotes the total number ofwords in the document, cwfqi

is the number of times that the query keyword qi occurs in all the documents of the collection,and coll sizew is the total number of words in the collection. The k factor is the smoothing parameter, and its value is deter-mined empirically, k 2 ½0;1�.

Page 4: A relevance model for a data warehouse contextualized with documents

J.M. Pérez et al. / Information Processing and Management 45 (2009) 356–367 359

The retrieval model of the contextualized warehouse proposed in this paper also models the queries as sequences of key-words and follows a similar approach to compute the relevance of the documents.

Many well-known IR techniques, such as the relevance feedback, have a very intuitive interpretation in the classical prob-abilistic models (Robertson, 1997). These techniques require modifying a sample set of relevant documents according to theuser’s relevance judgments. However, they are difficult to integrate into the language modeling framework where there isnot such a notion of set of relevant documents.

The work on relevance modeling returns to the probabilistic models view of the document ranking problem, i.e., the esti-mation of the probability PðwijRÞ of finding the word wi in an ideal set of relevant documents R. The purpose of relevancemodeling is to identify those words that indicate relevance and thus will be effective when comprising a query. These papersmake the assumption that in the absence of training data, but given a query Q ¼ q1; q2; . . . ; qn, the probability PðwijRÞ can beapproximated by the probability Pðwijq1; q2; . . . ; qnÞ of the co-occurrence of the sequence of query keywords Q and the wordwi (Lavrenko & Croft, 2001), that is

1 Notassume

PðwijRÞ � PðwijQÞ ¼Pðwi;QÞ

PðQÞ ð3Þ

Let M ¼ fmjg be the finite universe of language models mj that (notionally) generated the documents of the collection. Ifwe assume independence between the word wi and the query keywords Q1, the joint probability Pðwi;QÞ can be then com-puted by the total probability of emitting the word and the query keywords from each language model in M:

Pðwi;QÞ ¼X

mj2M

PðmjÞPðwijmjÞPðQ jmjÞ ð4Þ

Formula (4) can be interpreted as follows: PðmjÞ is the probability of selecting a language model mj from the set M,PðwijmjÞ is the probability of emitting the word wi from the language model mj, and PðQ jmjÞ is the probability of emittingthe query keywords Q from the same language model. Like in the language modeling approach (Song & Croft, 1999), theprobability PðwijmjÞ can be estimated by the smoothed relative frequency of the word in the document. See formula (2).

By applying the Bayes’ conditional probability theorem, the probability PðQ jmjÞ can be computed by

PðQ jmjÞ ¼PðmjjQÞPðQÞ

PðmjÞð5Þ

Replacing PðQ jmjÞ by the previous expression in formula (4), we obtain:

Pðwi;QÞ ¼X

mj2M

PðwijmjÞPðmjjQÞPðQÞ ð6Þ

Finally, by including formula (6) in the expression (3), the approximation of the probability PðwijRÞ results in

PðwijRÞ �X

mj2M

PðwijmjÞPðmjjQÞ ð7Þ

In order to implement the relevance models in an IR system, the set M is restricted to contain only the language models ofthe k top-ranked documents retrieved by the query Q. The system performs the following two steps (Lavrenko et al., 2002):

1. Retrieve from the document collection the documents that contain all or most of the query keywords and rank the doc-uments according to the probability PðmjjQÞ that they are relevant to the query. As formula (5) shows, this is equivalent torank the documents by the probability PðQ jmjÞ, since the probabilities PðmjÞ and PðQÞ are constant across queries. Thelanguage modeling formula proposed in Song and Croft (1999) can be used for this purpose, see formula (1). Let RQ bethe set composed of the language models associated with the top r ranked documents. RQ stands for documents Relevantto the Query.

2. Approximate the probability PðwijRÞ of finding a word wi in the ideal set of relevant documents R by the probabilityPðwijRQÞ of emitting it from the set of relevant document language models RQ

PðwijRÞ � PðwijRQÞ �X

mj2RQ

PðwijmjÞPðmjjQÞ ð8Þ

The main contribution of relevance modeling is the probabilistic approach discussed above to estimate PðwijRÞ using thequery alone, which has been done in a heuristic fashion in previous works (Robertson & Jones, 1976). This approximation toPðwijRÞ can be latter used for applying the probability ranking principle. For instance, the authors of Lavrenko and Croft(2001) represent the documents as a sequence of independent words (let wi 2 dj be each one of these words) and proposeto rank the documents by

ice that it is not a realistic assumption, since correlation between words always exists in the texts. However, as many other retrieval models, we need toindependence in order to compute the joint probability.

Page 5: A relevance model for a data warehouse contextualized with documents

360 J.M. Pérez et al. / Information Processing and Management 45 (2009) 356–367

PðdjjRÞPðdjjRÞ

�Y

wi2dj

PðwijRÞPðwijRÞ

ð9Þ

The models of relevance have been shown to outperform base-line language modeling and tf/idf IR systems in TREC ad-hoc retrieval and TDT topic tracking tasks (Lavrenko & Croft, 2001). Moreover, relevance modeling provides a theoreticallywell-founded framework where not only is possible to calculate the probability of finding a word in the set of documentsrelevant to an IR query, but also to estimate the probability of observing any arbitrary type of object described in this setof relevant documents. For example, inLavrenko, Feng, and Manmatha (2003) relevance models are applied in image retrievaltasks to compute the joint probability of sampling a set of image features and a set of image annotation words.

The notion of the set RQ of documents relevant to an IR query can be used for representing the context of analysis in acontextualized warehouse. The relevance model presented in this paper adapts this idea to estimate the probability ofobserving a corporate fact described in the set of documents relevant to the context of analysis.

4. The facts relevance model

In this section, we propose a relevance model to calculate the relevance of a fact with respect to a selected context (i.e., toan IR query). Intuitively, a fact will be relevant for the selected context, if the fact is found in a document which is also rel-evant for this context. We will consider that a fact is important in an document if its dimension values are mentioned fre-quently in the document textual contents.

We assume that each document dj describes a set of facts ffig; and that the document and its fact set were generated by amodel mj that emits words, with a probability PðwijmjÞ, and facts, with a probability PðfijmjÞ.

Definition 1. Let D1;D2; . . . ;Dn be the dimensions defined in the corporate warehouse OLAP cubes. A fact fi consists of an n-tuple of dimension values ðv1;v2; . . . ;vnÞ, where vk 2 Dk, meaning that each vk is a value of the dimension Dk. By vk 2 fi wewill mean that vk is a dimension value of the fact fi.

The tuple ðfo1; Japan;1998=10Þ 2 Products� Customers� Time represents the fact f4 of the cube characterized by thedimensions Products, Customers and Time, shown in Fig. 1.

Notice that at this point, we are only concerned about the occurrence of dimension values in the documents, indepen-dently of the hierarchy level which they belong to. Then, in the relevance model, we simply consider a dimension as the flatset that includes all the members of the dimension hierarchy levels, as specified in the corporate warehouse schema. Forexample, we do not make explicit the mapping of customers into cities, or states into countries in the Customers dimension;we just represent the dimension Customers by the set that comprises all the values of its hierarchy levels (e.g,Customers ¼ Customer [ City [ State [ Country).

Definition 2. Let Q ¼ q1; q2; . . . ; qn be an IR query, consisting of a sequence of keywords qi; and let RQ be the set models thatgenerated the documents relevant to this query. We compute the relevance of a fact fi to the query Q by the probabilityPðfijRQÞ of emitting this fact from the set RQ of models relevant to the query, as follows:

PðfijRQÞ ¼P

mj2RQ PðfijmjÞPðQ jmjÞP

mj2RQ PðQ jmjÞð10Þ

That is, we estimate the relevance of a fact by calculating the probability of observing it in the set of documents relevantto the query. In formula (10) PðQ jmjÞ is the probability of emitting the query keywords from the model mj. This probability iscomputed by the language modeling formula (1).

Definition 3. PðfijmjÞ is the probability of emitting the fact fi from the model mj, which is estimated as follows:

PðfijmjÞ ¼P

vk2fifreqðvk;djÞjdjjv

ð11Þ

where freqðvk; djÞ is the number of times that the dimension values vk are mentioned in dj; and jdjjv is the total number ofdimension values found in the document dj.

The approach discussed above to compute the probability PðfijRQÞ is based on the relevance modeling techniques pre-sented in Section 3. However, we have adapted these techniques to estimate the probability of facts instead of documentwords. Next, we point out the major similarities and differences between the two approaches.

The probability PðQ jmjÞ can be expressed in terms of the probability PðmjjQÞ, by applying the conditional probability for-mula (5). By including the expression (5) in the formula (10), we have that

PðfijRQÞ ¼P

mj2RQ PðfijmjÞPðmjjQÞPðQÞPðmjÞ

Pmj2RQ PðQ jmjÞ

ð12Þ

In formula (12) PðQÞ is the joint probability of emitting the query keywords from the set of models RQ, and PðmjÞ denotesthe probability of selecting a model from this set. In order to estimate the probability PðQÞ, we compute the total probability

Page 6: A relevance model for a data warehouse contextualized with documents

Table 2Topic number, title and expected top-ranked industry of the TREC topics selected for the experiment.

Topic # Title Industry

109 Find Innovative Companies Software & Computer Services112 Funding Biotechnology Biotechnology124 Alternatives to Traditional Cancer Therapies Health Care Equipment & Services133 Hubble Space Telescope Aerospace & Defense135 Possible Contributions of Gene Mapping to Medicine Biotechnology137 Expansion in the US Theme Park Industry Media143 Why Protect US Farmers? Food Producers152 Accusations of Cheating by Contractors on US Defense

ProjectsAerospace & Defense

154 Oil Spills Oil & Gas Producers162 Automobile Recalls Automobiles & Parts165 Tobacco Company Advertising and the Young Tobacco173 Smoking Bans Tobacco179 U. S. Restaurants in Foreign Lands Restaurants & Bars183 Asbestos Related Lawsuits Construction & Materials187 Signs of the Demise of Independent Publishing Media198 Gene Therapy and Its Benefits to Humankind Biotechnology

Recall

Prec

isio

n

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Fig. 2. Average precision versus recall obtained for the selected TREC topics.

J.M. Pérez et al. / Information Processing and Management 45 (2009) 356–367 361

of emitting the query keywords from each model in RQ. See formula (13). Notice that the assumption that we make here isequivalent to the one made by the relevance modeling works in formula (4) to calculate the joint probability Pðki;QÞ.

PðQÞ ¼X

mj2RQ

PðQ jmjÞPðmjÞ ð13Þ

By considering that the probability PðmjÞ is constant, and replacing the probability PðQÞ by the previous expression, wehave that formula (12) is equivalent to

PðfijRQÞ ¼X

mj2RQ

PðfijmjÞPðmjjQÞ ð14Þ

Notice the similarity between formula (14) and the relevance modeling formula (8) used for computing the probabilityPðwijRQÞ. The difference comes in considering that whereas the ordinary relevance modeling proposals approached the prob-ability PðwijRÞ by the probability of observing the word wi, once the query keywords Q have been observed in the documents,i.e., PðwijRÞ � PðwijQÞ, we approach the probability PðfijRÞ by the probability of finding the fact fi when the query keywords Qhave been previously found in the documents, that is, PðfijRÞ � PðfijQÞ.

Page 7: A relevance model for a data warehouse contextualized with documents

Table 3Average precision versus recall values obtained for the selected TREC topics.

Recall level Average precision

0.0 0.84030.1 0.66710.2 0.56900.3 0.52830.4 0.44720.5 0.41670.6 0.37280.7 0.35250.8 0.25560.9 0.16971.0 0.0057

Average 0.4205

Table 4R-Precision obtained for each TREC topic.

Topic # R-Precision

109 0.2727112 0.2500124 0.5000133 0.4762135 0.7500137 0.7083143 0.4615152 0.1852154 0.5294162 0.3333165 0.3500173 0.5526179 0.1250183 0.6842187 0.3718198 0.7419

Average 0.4558

109

112

124

133

135

137

143

152

154

162

165

173

179

183

187

198

Topic number

R−P

reci

sion

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Fig. 3. R-Precision histogram for the selected TREC topics.

362 J.M. Pérez et al. / Information Processing and Management 45 (2009) 356–367

Page 8: A relevance model for a data warehouse contextualized with documents

Document ranking

F−m

easu

re

1 5 10 36 100 500 1000

0.10

000.

2000

0.30

000.

4534

Fig. 4. Average F-measure for the selected TREC topics with different sizes of the result set.

J.M. Pérez et al. / Information Processing and Management 45 (2009) 356–367 363

5. Experiments and results

This section evaluates the proposed relevance model with the Wall Street Journal (WSJ) TREC test collection (Harman,1995) and a fact database constructed from the metadata available in the documents. In our experiments, we took a setof example information requests (called topics in TREC), determined which is the expected most relevant fact in the responseresult for each topic, and analyzed the quality of the ranking of the facts provided by our model.

It is important to emphasize that the objective here is not to evaluate the document retrieval performance. The formulasused in our approach for estimating the relevance of a document and building the set RQ of relevant models are based onthose of language modeling, that have already been shown to obtain good performance results (Song & Croft, 1999). The finalobjective of our experiments is to evaluate the proposed facts relevance ranking approach.

Next, we introduce the document collection, the fact database and the topics selected for the experiments. Afterwards, weshow how we built the IR queries for the topics and tuned the set RQ. Finally, we study the results obtained when ranking thefacts with the relevance model.

5.1. Document collection, fact base and topics

In our experiments we considered the 1990-WSJ subcollection from TREC disk 2, a total of 21,705 news articles publishedduring 1990. The news articles of the WSJ subcollection contain metadata. These metadata comprise, among other informa-tion, the date of publication of the article, and the list of companies reported by the news article. By combining the date ofpublication and the company list of each article, we built a ðDate;CompanyÞ fact database. For each fact, we also kept thenews articles were the corresponding ðDate;CompanyÞ pair was found. Thus, our experiments involved two dimensions,the Date and the Companies dimensions. In the Companies dimension, the companies described by the WSJ articles are orga-nized into Industries, which are in turn classified into Sectors. The correspondence between companies, industries and sectorsis based on the Yahoo Finance2 companies classification.

We selected 16 topics from the TREC-2 and TREC-3 conferences by choosing the topics that have at least 20 documents inthe provided solution set of documents relevant for the topic. We made such a restriction to ensure that the set of relevantdocuments was big enough to find several samples of the dimension values relevant for the query. Furthermore, we exam-ined the textual description of each selected topic in order to determine the industry that is most likely related to the themeof the topic, that is, the industry of the companies that are expected to be found at the top-ranked facts for each topic.

2 http://finance.yahoo.com.

Page 9: A relevance model for a data warehouse contextualized with documents

Table 5Top-ranked industries for the TREC topics 109–152.

Industry Relevance

Topic 109, expected industry = software & computer servicesSoftware & computer services 0.6772Technology hardware & equipment 0.2510Fixed line telecommunications 0.0297Chemicals 0.0211

Topic 112, expected industry = biotechnologyBiotechnology 0.7565Pharmaceuticals 0.0981Aerospace & defense 0.0426

Topic 124, expected industry = health care equipment & servicesBiotechnology 0.6496Health care equipment & Services 0.1778Pharmaceuticals 0.1439Food & drug retailers 0.0249Technology hardware & equipment 0.0038

Topic 133, expected industry = aerospace & defenseAerospace & defense 0.9793General retailers 0.0207

Topic 135, expected industry = biotechnologyBiotechnology 0.8870Pharmaceuticals 0.0460Chemicals 0.0385Health care equipment & Services 0.0213

Topic 137, expected industry = mediaMedia 0.6262Industrial metals 0.3019Food producers 0.0234General retailers 0.0151

Topic 143, expected industry = food producersFood producers 0.9999Chemicals 3.2e–5

Topic 152, expected industry = aerospace & defenseAerospace & defense 0.9881Technology hardware & equipment 0.0083Electronic & electrical equipment 0.0016

364 J.M. Pérez et al. / Information Processing and Management 45 (2009) 356–367

Table 2 shows the topic number, title and expected top-ranked industry of the TREC topic set considered in our experi-ments. For example, as this table shows, the expected most relevant industry for the TREC topic number 198, entitled ‘‘GeneTherapy and Its Benefits to Humankind”, is Biotechnology.

5.2. Building the set RQ

In order to estimate accurately the relevance of the facts, we need an acceptable description of each selected topic in thecorresponding set of relevant models RQ. Next, we show how we constructed and tuned the context of analysis (i.e., the setRQ) for the test topics.

For each topic, we specified a short IR query (less than 4 keywords), and then we retrieved the set of documents relevantto this query, as discussed in Section 3. The smoothing parameter k of formula (2) determines the features of the top-rankeddocuments, mainly their length (Losada & Azzopardi, 2008). Larger documents are usually ranked at the first positions as kdecreases. In our case, larger documents are more likely to describe more dimension values than shorter ones, and thereforethey can contribute better to contextualize facts. Additionally, it is well-known that short queries require less smoothingthan larger ones. For these reasons, we set the smoothing parameter k to 0.1 in our experiments. Nevertheless, a deeperstudy of the influence of the smoothing method in the result must be carried out in the future.

The query keywords were interactively selected to reach an acceptable precision versus recall figure (Baeza-Yates & Ribe-iro-Neto, 1999). Typically, an ‘‘acceptable” retrieval performance is considered to be achieved when the precision is over 40%at low recall values, e.g, 20%; greater than 30% for a recall of 50%; and no lower than 10% for high recall percentages like 80%.See for example, the evaluations of Harman (1995), Lavrenko and Croft (2001) and Song and Croft (1999). Fig. 2 illustratesthe average precision values obtained at the 11 standard recall levels for the selected topics. The percentages are over theacceptable margins quoted above. Table 3 details these precision values.

The R-Precision is a useful parameter for measuring the quality of the result set for each individual topic, when the idealset R of documents judged to be relevant is known (Baeza-Yates & Ribeiro-Neto, 1999). Given jRj, the number of documents

Page 10: A relevance model for a data warehouse contextualized with documents

Table 6Top-ranked industries for the TREC topics 154–198.

Topics and industries Relevance

Topic 154, expected industry = oil & gas producersOil & gas producers 0.6348Oil equipment, services & distribution 0.3192Industrial transportation 0.0460

Topic 162, expected industry = automobiles & partsAerospace & defense 0.5426Automobiles & parts 0.4562Oil & gas producers 0.0008Chemicals 0.0002

Topic 165, expected industry = tobaccoTobacco 0.6356Media 0.1473Airlines 0.1263Industrial transportation 0.0877Aerospace & defense 0.0029

Topic 173, expected industry = tobaccoAirlines 0.4585Tobacco 0.3525Media 0.0701Industrial transportation 0.0431

Topic 179, expected industry = restaurants & barsRestaurants & bars 0.8930Beverages 0.0476Travel & leisure 0.0229

Topic 183, expected industry = construction & materialsConstruction & materials 0.7459Media 0.0591Chemicals 0.0579

Topic 187, expected industry = mediaMedia 0.9375Technology hardware & equipment 0.0283General retailers 0.0225

Topic 198, expected industry = biotechnologyBiotechnology 0.8858Chemicals 0.0718Pharmaceuticals 0.0249

J.M. Pérez et al. / Information Processing and Management 45 (2009) 356–367 365

in the ideal set R, it calculates the precision for the jRj top-ranked documents in the result set. Table 4 shows the R-Precisionvalues obtained for each topic, as well as the resulting average R-Precision. Fig. 3 depicts the corresponding R-Precisionhistogram.

As stated in Section 3, the set RQ comprises the models associated with the k top-ranked documents of the query. We nowturn our attention to tuning the size of the RQ sets. Here, our purpose is to determine the number k of top-ranked documentsto be considered in RQ that maximizes the retrieval performance. In this case, we use a different performance measure, calledF-measure (Baeza-Yates & Ribeiro-Neto, 1999), that calculates the harmonic mean of precision and recall. Maximizing the F-measure means finding the best possible combination of precision and recall. We computed the average F-measure for theselected TREC topics with different sizes of the result set. As Fig. 4 shows, the maximum value is 0,4534, reached when theresult set contains the 36th top-ranked documents.

5.3. Evaluation of results

Finally, in this section we evaluate the fact relevance ranking results obtained with our model. For each topic, we consid-ered the facts described by the 36 top-ranked documents in the corresponding set RQ. We grouped the facts by industry, andcalculated their relevance to the IR query following the approach discussed in Section 4. Tables 5 and 6 show the industries,along with their relevance, at the top of the ranking for each topic.

We can conclude that the results demonstrate the effectiveness of the approach. For all the topics, even for those wherethe R-Precision was low (see for example the topics 152 and 179), the expected industry is found at the first (81% of the top-ics) or the second (19%) position of the ranking.

Furthermore, the relevance value assigned to facts clearly differentiates the industries that are directly related to the topicof analysis from those that are not so relevant. In almost all cases, the relevance value approximately decreases by one orderof magnitude. For example, in the topic number 154, the first (Oil & Gas Producers) and second (Oil Equipment, Services & Dis-

Page 11: A relevance model for a data warehouse contextualized with documents

366 J.M. Pérez et al. / Information Processing and Management 45 (2009) 356–367

tribution) ranked industries are clearly related to the thematic of the topic (‘‘Oil spills”). The relevance values assigned tothese industries (0.6348 and 0.3192, respectively) are significantly greater than the relevance value of the next industryin the ranking (Industrial Transportation, 0.0460).

We also find an explanation for some of the topics where the ranking was not completely accurate. The top-raked indus-try for the topic number 173 is Airlines, whereas the expected industry Tobacco is found at the second position of the ranking.The reason is that a number of the documents judged to be relevant for this topic report smoking bans on flights. The indus-try at the top of the ranking for the topic 137 is Media, since many media companies also own thematic parks (e.g., TimeWarner/ Warner Bros. Entertainment). In fact, in our Companies dimension, the Media industry also comprises these recre-ation and entertainment companies. The second top-ranked industry for this topic is Industrial Metals, which still has a rel-ative high relevance value. Although, this industry initially seemed irrelevant for the topic 137, after reading some of thedocuments retrieved for this topic, we discovered a group of news relating the vanguardist Japanese company Nippon Steel’sdiversification strategy on the amusement-park sector.

6. Conclusions

This paper introduces a new relevance model aimed at ranking the structured data (facts) and documents of a contextu-alized warehouse when the user establishes an analysis context (i.e., runs an IR query). The approach can be summarized asfollows. First, we use language modeling formulas (Ponte & Croft, 1998) to rank the documents by the probability of emittingthe query keywords from the respective language model. Then, we adapt relevance modeling techniques (Lavrenko & Croft,2001) to estimate the relevance of the facts by the probability of observing its dimension values in the top-rankeddocuments.

We have evaluated the model with the Wall Street Journal (WSJ) TREC test subcollection and a fact database self-con-structed from the metadata available in the documents. The results obtained are encouraging. The experiments show thatour relevance model is able to clearly differentiate the facts that are directly related to the test topics, from those thatare not so relevant. We found the expected top-ranked fact in the first or the second position of the ranking for the 16 topicsselected. A deeper study of the influence of the smoothing method in our approach remains to be done.

In the prototype of the contextualized warehouse presented in Pérez et al. (2007), a corporate warehouse with data fromthe World major stock indices is contextualized with a repository of business articles, selected from the WSJ TREC collectiontoo. The prototype involved a dataset of 1936 (Date, Market, Stock Index value) facts and 132 documents. Although we didnot formally evaluate the relevance model of the prototype, we showed some analysis examples where the relevant articlesexplain the increases and decreases of the stock indexes. Testing the performance of the contextualized warehouse analysisoperations with larger datasets and studying query optimization techniques is also future work.

One of the current research lines in the field of IR is opinion retrieval (Eguchi & Lavrenko, 2006; Liu, Hu, & Cheng, 2005).These papers propose specific techniques for retrieving and classifying opinions expressed in small text fragments (like theposts of a web forum). We are currently working on extending our retrieval model with opinion retrieval techniques in orderto contextualize a traditional company’s sales data warehouse with documents gathered from web forums, where the cus-tomers review the products/services of the company.

References

Baeza-Yates, R. A., & Ribeiro-Neto, B. A. (1999). Modern information retrieval. ACM Press/Addison-Wesley.Codd, E. F. (1993). Providing OLAP to user-analysts: An IT mandate.Danger, R., Berlanga, R., & Ruiz-Shulcloper, J. (2004). CRISOL: An approach for automatically populating semantic web from unstructured text collections. In

Proceedings of 15th international conference on database and expert systems applications (pp. 243–252).Eguchi, K., & Lavrenko, V. (2006). Sentiment retrieval using generative models. In Proceedings of the 2006 conference on empirical methods in natural language

processing (pp. 345–354).Harman, D. K. (1995). Overview of the third retrieval conference (TREC-3). In D. K. Harman (Ed.), Overview of the third text retrieval conference (TREC-3) (pp.

1–19). NIST Special Publication 500-225.Inmon, W. H. (2005). Building the data warehouse. John Wiley & Sons.Lavrenko, V., Allan, J., DeGuzman, E., LaFlamme, D., Pollard, V. & Thomas, S. (2002). Relevance models for topic detection and tracking. In Proceedings of the

second international conference on human language technology research (pp. 115–121). San Francisco, CA: Morgan Kaufmann Publishers Inc.Lavrenko, V., & Croft, W. B. (2001). Relevance-based language models. In Proceedings of the 24th annual international ACM SIGIR conference on research and

development in information retrieval (pp. 120–127).Lavrenko, V., Feng, S. L., & Manmatha, R. (2003). Statistical models for automatic video annotation and retrieval. In Proceedings of the IEEE international

conference on acoustics, speech and signal processing (pp. 17–21).Liu, B., Hu, M., & Cheng, J. (2005). Opinion observer: Analyzing and comparing opinions on the web. In Proceedings of the 14th international conference on the

world wide web (pp. 342–351).Llidó, D. M., Berlanga, R., & Aramburu, M. J. (2001). Extracting temporal references to assign document event-time periods. In Proceedings of the 12th

international conference on database and expert systems applications (pp. 62–71).Losada, D. E., & Azzopardi, L. (2008). An analysis on document length retrieval trends in language modeling smoothing. Information Retrieval, 11(2), 109–138.Pedersen, T. B., & Jensen, C. S. (2005). Multidimensional databases. In R. Zurawski (Ed.), The industrial information technology handbook (pp. 1–13). CRC Press.Pérez, J. M. (2007). Contextualizing a data warehouse with documents. PhD thesis, Departament de Llenguatges i Sistemes Informàtics, Universitat Jaume I de

Castelló (Spain).Pérez, J. M., Berlanga, R., Aramburu, M. J. & Pedersen, T. B. (2007). R-cubes: OLAP cubes contextualized with documents. In Proceedings of the IEEE 23rd

international conference on data engineering (pp. 1477–1478).Pérez, J. M., Berlanga, R., Aramburu, M. J., & Pedersen, T. B. (2008). Contextualizing data warehouses with documents. Decision Support Systems, 45(1), 77–94.

Page 12: A relevance model for a data warehouse contextualized with documents

J.M. Pérez et al. / Information Processing and Management 45 (2009) 356–367 367

Ponte, J. M., & Croft, W. B. (1998). A language modeling approach to information retrieval. In Proceedings of the 21st annual international ACM SIGIR conferenceon research and development in information retrieval (pp. 275–281). New York, NY: ACM Press.

Robertson, S. (1997). The probability ranking principle in IR. In Readings in information retrieval (pp. 281–286). Morgan Kaufmann Publishers Inc.Robertson, S., & Jones, K. S. (1976). Relevance weighting of search terms. Journal of the American Society of Information Science, 27(3), 129–146.Song, F., & Croft, W. B. (1999). A general language model for information retrieval. In Proceedings of the eighth international conference on information and

knowledge management (pp. 316–321). New York, NY: ACM Press.

Juan Manuel Pérez obtained the B.S. degree in Computer Science in 2000, and the Ph.D. degree in 2007, both from Universitat Jaume I, Spain. Currently, heis associate lecturer at this university. He is author of a number of papers in international journals and conferences such as Decision Support Systems, IEEETransactions on Knowledge and Data Engineering, DEXA, ECIR, ICDE, DOLAP, etc. His research interests are information retrieval, multidimensionaldatabases, and web-based technologies.

Rafael Berlanga is an associate professor of Computer Science at Universitat Jaume I, Spain. He received the B.S. degree from Universidad de Valencia inPhysics, and the Ph.D. degree in Computer Science in 1996 from the same university. He is author of several articles in international journals, such asInformation Processing & Management, Concurrency: Practice and Experience, Applied Intelligence, among others, and numerous communications ininternational conferences such as DEXA, ECIR, CIARP, etc. His current research interests are knowledge bases, information retrieval, and temporal reasoning.

María José Aramburu is an associate professor of Computer Science at Universitat Jaume I, Spain. She obtained the B.S. degree from Universidad Politécnicade Valencia in Computer Science in 1991, and a Ph.D. from the School of Computer Science of the University of Birmingham (UK) in 1998. She is author ofseveral articles in international journals, such as Information Processing & Management, Concurrency: Practice and Experience, Applied Intelligence, andnumerous communications in international conferences such as DEXA, ECIR, etc. Her main research interests include document databases, and theirapplications.