a hybrid technique for english-chinese cross language …zdbloom.com/pdf/talip.pdf ·...

5

A Hybrid Technique for English-ChineseCross Language Information Retrieval

DONG ZHOUUniversity of NottinghamMARK TRURANUniversity of TeessideTIM BRAILSFORDUniversity of NottinghamandHELEN ASHMANUniversity of South Australia

In this article we describe a hybrid technique for dictionary-based query translation suitable forEnglish-Chinese cross language information retrieval. This technique marries a graph-basedmodel for the resolution of candidate term ambiguity with a pattern-based method for the trans-lation of out-of-vocabulary (OOV) terms. We evaluate the performance of this hybrid techniquein an experiment using several NTCIR test collections. Experimental results indicate a substan-tial increase in retrieval effectiveness over various baseline systems incorporating machine- anddictionary-based translation.

Categories and Subject Descriptors: H.3.3 [Information Systems]: Information Storage andRetrieval—Information Storage and Retrieval; I.2.1 [Computing Methodologies]: Artificial In-telligence—Natural Language Processing

General Terms: Algorithms, Languages, Measurement, Theory

Additional Key Words and Phrases: Cross language information retrieval, disambiguation, graph-based analysis, patterns, unknown term translation

ACM Reference Format:

Zhou, D., Truran, M., Brailsford, T. and Ashman, H. 2008. A hybrid technique for English-Chinesecross language information retrieval. ACM Trans. Asian Lang. Inform. Process. 7, 2, Article 5(June 2008), 35 pages. DOI = 10.1145/1362782.1362784. http://doi.acm.org/10.1145/1362782.1362784.

This work was partially funded by a scholarship from the University of Nottingham and theHunan University.Author’s address: Dong Zhou, School of Computer Science, University of Nottingham, JubileeCampus, Nottingham, NG8 1BB, UK; email: [email protected] to make digital/hard copy of all or part of this material without fee for personal orclassroom use provided that the copies are not made or distributed for profit or commercial advan-tage, the ACM copyright/server notice, the title of the publication, and its date appear, and noticeis given that copying is by permission of the ACM, Inc. To copy otherwise, to republish, to poston servers, or to redistribute to lists requires prior specific permission and/or a fee. Permissionsmay be requested from the Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY10121-0701 USA, fax +1 (212) 869-0481, or [email protected]© 2008 ACM 1530-0226/2008/06-ART5 $5.00 DOI: 10.1145/1362782.1362784. http://doi.acm.org/

10.1145/1362782.1362784.

ACM Transactions on Asian Language Information Processing, Vol. 7, No. 2, Article 5, Pub. date: June 2008.

5: 2 · D. Zhou et al.

1. INTRODUCTION

The distinguishing hallmark of a cross language information retrieval (CLIR)engine is the linguistic disparity between the queries which are submitted andthe documents which are retrieved. To resolve this disparity, all CLIR enginesare required to incorporate some facility for language translation, an obviousrequirement if query representations and document representations are to bemeaningfully compared. In the past, this translation has often been effectedusing a machine readable bilingual dictionary [Adriani 2000; Ballesteros andCroft 1998; Jang et al. 1999; Maeda et al. 2000; Liu et al. 2005; Gao and Nie2006].

Unfortunately, bilingual dictionaries have two major weaknesses. The firstis an inherent tendency towards ambiguity. This problem stems from thechoice of possible translations. A typical bilingual dictionary will provide aset of alternative translations for each term within any given query. Choos-ing the correct translation of each term is a difficult task, and one that canseriously impact the efficiency of any related retrieval functions. The use of co-occurrence information extracted from a representative document collectionhas gone some way to address this shortcoming [Adriani 2000; Ballesteros andCroft 1998; Jang et al. 1999; Maeda et al. 2000; Liu et al. 2005; Gao and Nie2006], but the results published to date are still some distance from optimal.

The second major weakness peculiar to this type of translation is known asthe coverage problem. This refers to the limited linguistic scope of bilingualdictionaries. Certain types of words are not commonly found in this type ofresource, and it is these out-of-vocabulary (OOV) terms that will cause difficul-ties during automatic translation. A nonexhaustive list of word types that fallinto this problem category is as follows:

—Compound words

—Proper names, such as the identifiers attached to persons or organizations

—Technical terms, including newly invented words and phrases from specialistdisciplines

Previous attempts to solve the problem of unknown terms have tended toconcentrate on complex statistical solutions, but so far progress has been ham-pered by the difficulties inherent in separating genuine OOV translations fromnoisy terms [Cheng et al. 2004; Zhang and Vines 2004; Zhang et al. 2005; Luet al. 2007].

1.1 The Contributions of This Article

In this article we present a practical solution for each of problems describedabove. First, we describe a novel technique for the resolution of ambiguity indictionary-based query translation which considers co-occurrence informationas a network susceptible to graph-based analysis.1 Second, we introduce a new

1We are certainly not the first researchers to apply a graph-based approach to the problem ofdisambiguation. However, to the best of our knowledge, we are the first to apply this technique toquery translation (see further Section 2.1).


A Hybrid Technique for English-Chinese Cross Language Information Retrieval · 5: 3

approach to the translation of OOV terms which utilizes a computationally in-expensive pattern-based processing of mixed language text to generate candi-date translations. Finally, we present an extensive evaluation of a hybrid crosslanguage retrieval system which incorporates these two techniques.

1.2 The Structure of the Article

The remainder of the article is organized as follows: Section 2 examines previ-ous work related to query term disambiguation and unknown term translation,Section 3 describes a graph-based disambiguation algorithm, Section 4 exam-ines a pattern-based approach to OOV term translation, Section 5 describes thehybridization of these complimentary techniques within a single CLIR system,Sections 6 and 7 document the process we used to evaluate this system, andSection 8 concludes and speculates on work outstanding.

2. RELATED WORK

2.1 Candidate Term Ambiguity

Techniques addressing the problem of ambiguity in the context of CLIR havesteadily increased in sophistication over time. In the early stages of CLIR de-velopment, researchers often used fairly arbitrary rules for the disambiguationof candidate translations (e.g., selecting the first matching entry in a bilingualdictionary as the final translation). More recently, highly developed techniquesexploiting term co-occurrence statistics have been brought to bear on the prob-lem [Adriani 2000; Ballesteros and Croft 1998; Jang et al. 1999; Maeda et al.2000; Liu et al. 2005; Gao and Nie 2006].

The hypothesis grounding the use of the term co-occurrence data in this con-text states that the correct translations of individual query terms will tend toco-occur as part of a sublanguage while incorrect translations will not. In otherwords, this approach should be able to determine the most likely translationfor a given query by examining the pattern of term co-occurrence within somerepresentative text collection (e.g., the World Wide Web [Maeda et al. 2000]or a monolingual/unlinked corpora [Ballesteros and Croft 1998; Gao and Nie2006]). However, there is a problem with this general approach. This prob-lem relates to the mutually dependent relationship between each term withina multiple term query. Ideally, for each query term under consideration, wewould like to choose the best translation that is consistent with the transla-tions selected for all remaining query terms. However, this process of inter-term optimization has proved computationally complex for even the shortestof queries [Gao and Nie 2006]. A common workaround used by several re-searchers working on this particular problem [Adriani 2000; Ballesteros andCroft 1998; Liu et al. 2005; Gao and Nie 2006] involves use of the followinggreedy algorithm:

1. Given a source query q ∈ Q, for each word in q, acquire the set of all trans-lation alternatives Ti = {ti,1, ti,2, ..., ti,n} from the translation resources.



2. For each set Ti:(a) For each translation ti,m ∈ Ti, define the similarity measurement be-tween the translation word ti,m and the other set T j(T j 6= Ti) as the sum ofthe similarities between ti,m and each word in the set T j as

sim(ti,m, T′

i) =∑

∀tj,n∈T j

sim(ti,m, tj,n)

(b) Compute the coherence score for ti,m as

f (ti,m) =∑

∀i6= j

sim(ti,m, T j)

(c) Select the word w in Ti with the highest cohesion score

w = argmaxti,m f (ti,m)

For various reasons relating to term independence, the greedy algorithm de-scribed above is not ideal, and several attempts have been made to improve onit. For example, Sperer and Oard [2000] presented a modified Pirkola method[Pirkola 1998] called structured translation which used the co-occurrencemeasurements to assign term weights. In the same vein, Federico andBertoldi [2002] demonstrated a technique which incorporated the generatedco-occurrence information into a query translation Hidden Markov Model(HMM). Work described by [Gao et al. 2002] refined the greedy algorithm bymodifying the term similarity measurement applied in step 2(b). In this step,the similarity between two terms is usually calculated using the mutual infor-mation (MI) between two terms x and y as shown below:

sim(x, y) = MI(x, y) = P(x, y)× log(P(x, y)

P(x)× P(y))

where P(x) is the unigram probability for word x, and P(y) is the joint probabil-ity of word x and y co-occurring in a predefined text window. Gao et al. [2002]updated this measure by incorporating the distance factor D(x, y), defined as:

D(x, y) = e−α×(Dis(x,y)−1)

where α is the decay rate and the Dis(x, y) is the average distance between x

and y in the corpus. The resulting decaying co-occurrence model

sim(x, y) = MI(x, y)× D(x, y)

was evaluated using the TREC-9 test collection [Voorhees and Harman 2000]and comfortably outperformed the basic MI model (with α = 0.8 yielding thebest results). Three years later, Liu et al. [2005] revisited the use of mutualinformation and introduced a maximum coherence model which estimated thetranslation probabilities of query words from term co-occurrence statistics bymaximizing the overall coherence of the corresponding query. Simultaneously,Monz and Dorr [2005] published an algorithm which estimated the translationprobabilities using an iterative expectation-maximization algorithm based onthe co-occurrence statistics between all translation candidates.



The work described in section 3 of this article continues this general trendof research into the usefulness of mutual information by considering term co-occurrence as a graph susceptible to recursive analysis. This approach wasinspired by the current popularity of techniques based on graphs induced byimplicit relationships between documents or other linguistic items [Erkan andRadev 2004; Kurland and Lee 2005; Mihalcea and Tarau 2004; Zhou et al.2007].

2.1.1 Word Sense Disambiguation. Our approach to candidate term ambi-guity shares strong commonalities with research investigating the utility ofgraph-based algorithms for large-scale word sense disambiguation (WSD) [Mi-halcea 2005]. In this particular field, each node within a graph structure repre-sents one possible sense of a given word sequence, while the edges connectingthe nodes correspond to semantic dependencies between those interpretations(e.g., antonymy, synomy). The process of determining the correct sense fora given word sequence exploits these semantic dependencies to calculate themost “important” node, thereby resolving the ambiguous sequence.2

The disambiguation algorithm described in this article, which selects themost appropriate translation for a given term using recursive graph-basedanalysis based on co-occurrence measures, could be viewed as a logical ex-tension to this technique, but is actually quite distinct in practical terms.Our technique employs a number of graph centrality measurements whichare driven by term co-occurrence measurements rather than semantic depen-dencies. Furthermore, unlike the majority of WSD systems, our algorithm isspecifically designed for a bilingual text environment.

2.2 Out-of-Vocabulary Terms

The problem of unknown terms is fairly widespread in CLIR research and hasattracted a considerable amount of time and study. Early solutions advocatedthe use of domain-specific bilingual dictionaries, which did deliver greater ac-cess to troublesome technical terms and specialized vocabularies, but were of-ten fairly narrow in terms of scope [Pirkola 1998]. Later approaches to theproblem adopted the technique of transliteration, which involves identifyingsimilarities in the orthographic structures of two languages in an attempt togenerate rules specifying how certain substrings written in one language arespelled in another. [AbdulJaleel and Larkey 2003; Buckley et al. 2000; Kangand Kim 2000; Fujii and Ishikawa 2001; Qu et al. 2003; Virga and Khudan-pur 2003]. Transliteration is essentially a string matching exercise [Pirkolaet al. 2002, 2003], which obviously works best when the languages involvedshared a common alphabet. Transliteration between languages with dissim-ilar alphabets requires a process known as phonetic mapping. This involvesgenerating rules (using linguistic analysis or statistical methods) represent-ing the phonetic presentation of the languages involved, then manipulating

2See Brody et al. [2006] for alternative graph-based approaches which determine word sense usingsimilarity-based algorithms dependent on context.



these rules to resolve the OOV term. Using this technique, Qu et al. [2003] de-veloped a translation system for Japanese names and words using a phoneticEnglish dictionary, and AbdulJaleel and Larkey [2003] developed a generativestatistical model for the translation of Arabic into English.

More recently, consideration of the coverage problem has naturally aligneditself with the World Wide Web as a potential repository of translation patterns[Cheng et al. 2004; Lu et al. 2002; Zhang and Vines 2004; Zhang et al. 2005;Cao et al. 2007]. Most of the work in this area is dependent upon a certainquirk of authorship. When new terms, foreign terms, or proper nouns are usedin multilingual Web text, they are frequently accompanied by a courtesy trans-lation in the vicinity of the original text. By specifically targeting these multi-lingual Web pages, then subjecting the text they contain to statistical analysis,it is often possible to extract these complementary translations [Cheng et al.2004; Zhang and Vines 2004; Zhang et al. 2005; Cao et al. 2007]. However,there is a drawback. Techniques which address the coverage problem usinga statistical analysis of mixed-language Web pages have proved to be fairlyerror prone. Studying this problem, Chen and Ma commented that the useof exclusively statistical methods in this particular context would inevitablyresult in low precision and low recall, with poor performance often related tothe misidentification of low-frequency terms [Chen and Ma 2002]. Proposing asolution, the authors suggested a consideration of linguistic patterns alongsideconcurrence measures as the possible antidote.

Now, the use of linguistic and punctuative patterns to assist documentprocessing is well established. Researchers working on the problems of nat-ural language processing have frequently used linguistic patterns to improveontology generation and knowledge discovery [Iwanska et al. 1999]. Previousexperience has demonstrated that even a simple understanding of languagecan often yield useful information. A good example of this statement in actioncan be found in Hearst [1992], where the author used just two lexico-syntacticpatterns to successfully acquire publication-related hyponyms from text cor-pora. Hearst used the pattern “such NounPhrase as {NounPhrase,}*{(or/and)}NounPhrase” to target sentences such as “works by such authors as Herrick”.Application of this pattern to a text corpus automatically generated a list ofhyponyms (“author”, “Herrick”), providing a low-cost approach to knowledgediscovery. Further examples illustrating the utility of syntactic and linguis-tic patterns can be found in the field of Web mining. For example, Liu et al.[2003] demonstrated a technique for mining a topic specific eBook from theWeb using a pattern-based approach (see also Chen et al. [2005]), and vari-ous other authors have successfully extracted class and instance relationshipsusing a similar technique [Cimiano et al. 2004, 2005; Etzioni et al. 2004].Finally, a fairly contemporary application of patterns is described by Cao et al.[2007], who used a single parenthesis-based template in combination withvarious transliterative techniques to extract translation candidates.

Informed by the discussion above, we have developed a Web-based trans-lation technique for unknown terms which combines traditional statisticalmethods with linguistic and punctuative pattern matching. This techniqueis described in full in Section 4.



Fig. 1. Example of a five term co-occurrence graph, taken from NTCIR-3 query set, no. 001 Title

field: Exhibition, Art, Culture, Han, Dynasty.

3. CANDIDATE DISAMBIGUATION USING GRAPH THEORY

We propose that the co-occurrence of possible translation terms within a givencorpus can be viewed as a graph. In such a graph each translation candi-date of a source query term is represented by a single node. Edges drawnbetween these nodes are weighted according to a particular co-occurrence mea-surement. Figure 1 illustrates a graph view of the possible translation candi-dates for a sample five term query. In this diagram, translation candidatesfor the same query term have been grouped together, and only the candidateswith the highest co-occurrence scorings have been connected.

Viewing the co-occurrence of possible translation terms within a given cor-pus in this way has one particular advantage. It converts simple concurrencestatistics into a pattern susceptible to various analytical techniques pioneeredin the field of information retrieval.

3.1 Graph Analysis Algorithm

The basic algorithm for selecting the best translation among a set of possiblecandidates is as follows:

1. Given a query Q containing several terms {q1, q2, ..., qn}, for each termqi, i ∈ [1, n] in Q, obtain the translation candidates T(qi) = {ti,1, ti,2, ..., ti,m}.



2. All possible translation candidates of the query terms are generated, toform a undirected weighted graph: G =< V, W >, where V is the set of verticesrepresenting one translation candidate ti, j to the query term qi, and W is a com-plete set of weighting functions. Hence, every possible pairing of translationcandidates has a nonnegative weight attribute, w, which indicates the proba-ble strength of any link potential between the two translation candidates. Theset of weights as a whole can be described as:

W : V × V −→ {w ∈ ℜ : w ≥ 0}

An individual weighting between two translation candidates ti, j and tk,l isgiven by the function:

w(ti, j←→ tk,l)

(n.b. Each of these weights are calculated using co-occurrence statistics,but within this remit there are still two possible approaches, as discussed insection 3.2)

3. For each translation candidate, ti, j, compute the Centrality Score in orderto determine Cen(ti, j) for every single translation candidate in the graph (thisprocess is discussed in more detail in section 3.3).

4. The translation of a query term is then determined by selecting the trans-lation candidate, ti, j, which produces the maximum Centrality Score in the cor-respondent set of translation candidates:

ti = Maxti, j∈TiCen(ti, j)

3.2 Weighting Functions

In our algorithm, a weight w is added to each edge connecting a pair of nodesin the co-occurrence graph. This weight indicates the “strength” of the linkbetween the nodes. Two different weighting functions have been developed forthis task, which are called StrengthWeighting (SW) and FixedWeighting (FW)

respectively. The SW function produces a undirected, weighted graph, whilethe FW function produces an unweighted alternative.

StrengthWeighting. If the similarity score between two terms is more thana certain threshold β, then the weight between the two terms is equivalent tothe similarity score. Otherwise, the weight is set to zero.

w(ti, j←→ tk,l) =

{

sim(ti, j, tk,l) sim(ti, j, tk,l) > β

0 otherwise

FixedWeighting. If the similarity score between two terms is more thana certain threshold β, then the weight between the two terms is set to one.Otherwise, the weight is set to zero.

w(ti, j←→ tk,l) =

{

1 sim(ti, j, tk,l) > β

0 otherwise



The actual value of β is determined empirically (see section 6.7.2) and isused here to exclude word pairings with low-frequency counts [Manning andSchutze 1999].

3.3 Calculating the Centrality Score

Graph-based analysis is essentially a technique for deciding the importanceof a single node using global information recursively drawn from the entiregraph. In our formulation of this approach, which is tailored to candidatetranslation selection, the importance of a node within the graph is describedas its centrality. The following subsections describe three different ways ofcalculating node centrality.

3.3.1 Calculating centrality using the indegree measure. The simplest wayof assessing the centrality of a node within a co-occurrence graph involvescalculating the number of edges incident upon that node (i.e., the indegree

measure). In the context of an undirected graph, this measurement actu-ally equates to a node’s degree, since a connecting edge contributes equallyto the degree of each connected node. Stated formally, for a undirected graphG = (V, E), the centrality of a given node Vi is defined as:

Cen(Vi) = Indegree(Vi) =∑

(Vi,V j∈E)

wij

where wij is the weight assigned to the edge connecting the nodes Vi and V j.Now, as mentioned in §3.2, we use two different weighting functions to in-

dicate the strength of the link between two nodes in the co-occurrence graph.Each of these functions will produce a different type of graph (i.e., the SW

function will produce an undirected weighted graph and the FW function willproduce an undirected unweighted graph). The precise type of graph deter-mines the method used to calculate the indegree centrality of a node. In anundirected unweighted graph, the centrality of a node will be calculated by asimple count of the number of edges incident upon it, meaning the translationcandidate for a given query term with the maximum indegree measure will beselected as the final translation term. By contrast, in a undirected weightedgraph, the centrality of a node is calculated using node-node co-occurrencescores represented by edge weightings. Interestingly, this means that the re-sult of determining the centrality of a node using this method will be the sameas if we had employed the greedy algorithm discussed in section 2.1, since thecoherence measure (step 2(b) of the algorithm) is identical to the indegree of anode in an undirected weighted graph. This can be demonstrated fairly easily.If we calculate the MI similarity measure between two translation candidatesx and y so that

sim(x, y) =

{

1 MI(x, y) > 00 otherwise

we will arrive at the same result had we calculated the indegree for these nodeswithin an undirected unweighted graph.



3.3.2 Calculating centrality using the random walker model. The nextscheme involves centrality calculations based on a random walk around thegraph structure. Up to this point, when calculating node indegree, we haveapplied a totally democratic method whereby each edge is considered as one“vote” contributing to that node’s centrality score. However, this scheme hasa weakness which is exposed when several unwanted translations “vote” foreach other, thereby unhelpfully raising their indegree scorings. Fortunately,this situation can be completely avoided by weighting the “votes” bestowedby each edge according to the centrality of the voting nodes. This approach,which is derived from the hyperlink analysis technique known as PageRank[Brin and Page 1998], is informed by an analysis of a stochastic procedurethat involves a random walk around the graph. The rules for this ‘walk’ areas follows—beginning at some arbitrary node A in the undirected graph, wemove from node A to a randomly chosen node linked to A. This step is contin-ually repeated. If, at any point, there is a lack of node outlinks making furthertraversal impossible, we jump to a randomly selected node in the graph andcontinue. As the random walk proceeds, certain nodes in the graph structurewill be visited more often than others, intuitively leading us to the followingconclusions:

(1) Nodes which are visited frequently during the random walk are more likelyto be important than nodes visited less frequently.

(2) Nodes which are visited frequently during the random walk are more likelyto be connected to other frequently visited nodes than nodes which arevisited less frequently.

Here is a formal statement of a centrality measure that exploits the randomwalker model: for the given undirected graph G = (V, E), where V denotes aset of nodes and E denotes a set of edges, for a given Vi let {Vi}IN be a set ofnodes that point to it and let {Vi}OUT be a set of nodes that Vi points to, then,the centrality score of Vi is defined as follows:

Cen(Vi) = (1− d)/N + d×∑

j∈{Vi}IN

wi, j∑

Vk∈{V j}OUT

Cen(V j)

Where d is a dampening factor which integrates the probability of jumpingfrom one node to another at random (normally set to 0.85) and N is the totalnumber of nodes in the graph. Starting with an arbitrary value assigned toevery node in the graph, this algorithm is guaranteed to converge below acertain threshold.3

By using matrix notation and viewing the whole process as a Markov chain,we can reformulate these definitions. Let A be the similarity matrix whererows and columns represent the translation candidates and each entry denotesthe similarity score between them. Note that the similarity scores betweentranslation candidates for the same query term are initially set to zero. Letz be the centrality vector that corresponds to the stationary distribution of

3This is guaranteed because the Markov chain here is irreducible and aperiodic.



A, and let U be a square matrix with all elements being equal to 1/N. Theequation above can be written in the matrix form as

z = [(1− d)U + dA]Tz

The transition kernel [(1 − d)U + dA] of the resulting Markov chain is amixture of two kernels U and A. A random walker on this Markov chainchooses one of the adjacent states of the current state with probability d, orjumps to any state in the graph, including the current state, with probability1− d.

3.3.3 Calculating centrality using hubs and authorities. A third methodfor calculating the centrality score of a node in an undirected graph borrowsheavily from Kleinberg’s HITS algorithm [Kleinberg 1999]. In this scheme,each translation candidate in the graph is assigned a hub score and an author-

ity score. These scores have a recursive and mutually reinforcing relationship,so that:

(1) A strong authority is a node that is pointed at by a high number of nodeswith strong hub scores.

(2) A strong hub is a node that points to a high number of nodes with strongauthority scores.

Formally, our algorithm defines the authority score for an individual candi-date term, t, over the whole collection of candidate terms, T, as:

Authority(t) =∑

t′∈T

w(t′ −→ t)× Hub (t′)

Thus, the authority score for a candidate term t is calculated as the sum ofall weights on edges connecting t to other terms t′ in the graph, multiplied bythe hub scores for t′. In an exact mirror of this calculation we determine thehub score for a candidate term as the sum of all weights on edges connecting t

to any other terms t′ in the graph, multiplied by the authority scores for t′:

Hub (t) =∑

t∈T

w(t−→ t′)× Authority(t′)

These two equations are mutually recursive. However, as with the itera-tive HITS algorithm [Kleinberg 1999], it can be proven that these equationswill converge to score functions Hub∗ and Authority∗ (which are nonidenti-callyzero, nonnegative). When this algorithm has converged, those translationterms with the top hub scores or the top authority scores will be natural choicesfor providing final translation terms.

This subsection concludes our theoretical discussion of the graph-based ap-proach to candidate term ambiguity. In the next section we discuss anothertechnique designed for CLIR which addresses the difficult resolution of OOVterms.



4. A PATTERN-BASED APPROACH TO OOV TERMS

A high-level summary of our approach to unknown terms is as follows:

—Detect the OOV terms in a query.

—Submit these terms to a search engine and cache the results.

—Apply a set of punctuative and linguistic patterns to the mixed language textof the cached Web pages.

—Extract one or more translation candidates.

—Select a final translation for the OOV using any suitable disambiguationstrategy.

These stages are described in greater detail in the following subsections.

4.1 OOV Detection

The first stage in this process, which involves identifying the presence of OOVterms in a source language query, is surprisingly difficult. Any query term notfound in a bilingual dictionary is, by definition, out of vocabulary. However,this is by no means the end of the matter. To illustrate this point, assumethat we have a three-word query “open source download” which we need totranslate. Each of these three query terms can no doubt be translated indepen-dently using a bilingual dictionary, as they are very common words. However,the resulting translation, pieced together term by term, is unlikely to providea meaningful translation of the opening two-term phrase. To make mattersworse, even when a phrase is recognized, that phrase may or may not be OOVitself.

Given the above, we decided that our pattern matching process had to beginwith a robust phrase identification stage. Therefore, queries are first analyzedusing a pair of noun phrase recognizers.4 If there are no phrases found inthe query, then the query is translated term by term using a bilingual dictio-nary, with any term not successfully translated being labeled as an OOV term.If a phrase is identified in the query, and that phrase cannot be translatedby the bilingual dictionary, we search the mixed language Web for possibletranslations (please refer to the next step for details). If possible translationsare found, we treat the text string as an OOV phrase. If no possible trans-lations are found, we decompose the phrase into single terms and attempt aword-by-word translation, again using the bilingual dictionary. Any words notsuccessfully translated at this point are labeled as OOV terms. The patternmatching procedure from this point on is then identical whether we are tryingto translate a single OOV term or an OOV phrase.

4.2 Obtaining Mixed-Language Text

In the next step of our pattern-based approach to OOV terms, the unknownterm or phrase is submitted to a Web search engine. Our search is configured

4Noun phrase recognizers obtained from http://nlp.stanford.edu/software/index.shtml andhttp://www.gate.ac.uk



in such a way as to retrieve documents written in the target translation lan-guage. For example, if an unknown English term is to be translated into Chi-nese, then the search will be configured to only retrieve Chinese Web pages.Configuration of the search is performed using the advanced search criteriaprovided by most major search engines. The top 100 ranked titles and docu-ment surrogates are cached, and the stored text is parsed to remove HTMLtags and other non-content symbols, thereby creating a raw text collection.

4.3 Generation of Patterns

The next stage of the process involves generating a set of punctuative and lin-guistic patterns that we can apply to the mixed language text. The literatureon this subject suggests two different ways to generate these patterns. Thefirst approach, as discussed by Hearst, is manual extraction; examples of thetext collection are examined by hand for syntactic and linguistic relationships,and the patterns that are detected are recorded [Hearst 1992]. The secondmethod, as exemplified by the work of Serban et al. [2005], is autonomousextraction—the text is exposed to some application capable of automaticallydetecting the requisite patterns. Either approach (of a combination of both) canbe used to generate the patterns required for our OOV translation technique(see section 6.8 for the specific technique we employed in our English-Chinesecross-language experiment).

4.4 The Pattern Set

A typical pattern set will contain a mix of punctuative patterns and linguisticpatterns. We will deal with each type of pattern in turn. Concrete examples ofeach pattern, taken from the English-Chinese cross language retrieval experi-ment discussed later in this paper (see section 6), are provided for illustrationin Figure 2.

4.4.1 Punctuation Patterns. Punctuation patterns exploit the use of punc-tuation in the cached mixed-language Web pages. There are two different typesof punctuation patterns:

Symmetric punctuation patterns. These patterns capture those translationsof OOV terms delineated by brackets. This type of pattern includes all of thecommonly used bracketing symbols, together with known parenthetic substi-tutes. Examples 1 and 2 in Figure 2 illustrate the symmetric pattern in anEnglish-Chinese context.

Nonsymmetric punctuation patterns. These patterns capture translationpairs adjacent to single punctuation marks. This type of pattern includes all ofthe common nonparenthetic symbols that organize written text, like commas,colons, and full stops. Examples 3-5 in Figure 2 illustrate the nonsymmetricpunctuation pattern in an English-Chinese context.



Fig. 2. Examples of linguistic and punctuative patterns.

4.4.2 Linguistic Patterns. Linguistic patterns also fall into two generalheadings:

Eliminator Terms. These terms are similar to the words and phrases foundin stop lists, commonly used in corpus linguistics. This means that eliminatorterms have a low information value and, if extracted as part of the translationcandidate, will almost always decrease the accuracy of the translation. How-ever, one positive aspect of eliminator terms is that they frequently mark theboundaries between the translation pairing and the rest of the source docu-ment, and can therefore be used to delineate target translations. Example 4 inFigure 2 provides an illustration of an eliminator term in an English-Chinesepattern.

Discriminator Phrases. These phrases provide an implicit or explicit clueindicating the relationship between a term (or terms) in one language and atranslation in another. Clues of this nature are particularly valuable to theOOV translation process. This type of pattern includes all commonly usedexplanatory phrases employed by authors writing mixed-language pages. Ex-ample 5 in Figure 2 provides an illustration of a discriminator phrase withinan English-Chinese punctuation pattern.

4.5 Applying the Pattern Set

In the next step the raw text collection descibed in section 4.2 is subjected tothe pattern set in order to extract translation candidates. The output of this



pattern matching technique is a set of likely translation candidates for theOOV term. Although a final translation for the OOV term can be extractedfrom this set using any reliable disambiguation strategy [Zhou et al. 2007], wechose to perform final candidate selection by joining our pattern-based OOVtranslation technique to the graph based disambiguation protocol discussed insection 3, thereby creating the hybrid query translation technique discussedbelow.

5. A HYBRID QUERY TRANSLATION TECHNIQUE

Our hybrid query translation technique works by linking the two methods dis-cussed above so that the undifferentiated output of the OOV pattern matchingprocess (i.e., the set of all translation candidates derived using patterns) con-stitutes the input to the graph-based disambiguation algorithm. A summaryof the process chain is as follows:

(1) First, the target query is translated using a naive bilingual dictionary, aprocess which creates a semi-translated query set. In this set one queryterm may have several possible translation candidates.

(2) Next, the OOV terms and phrases are passed to our pattern matcher,which adds additional translation candidates derived using linguis-tic/punctuative patterns to the query set.

(3) Next, this entire set of translation candidates is subjected to thegraph-based disambiguation protocol described earlier in this article, us-ing co-occurrence scores obtained from the target document corpus.

(4) Finally, the fully translated query is passed to the CLIR engine.

An overview of the hybrid translation technique is shown in Figure 3.

6. EVALUATION

In the following section we describe a series of cross language retrieval ex-periments designed to evaluate (both in isolation and in combination) the twotranslation techniques described above. Our evaluation will focus on the fol-lowing thematically related questions:

(1) Is the proposed graph-based method for candidate term disambiguationan improvement over simpler techniques capable of resolving translationambiguities?

(2) How accurate, when compared with manual translations, are the transla-tions of OOV terms derived from pattern matching?

(3) Is the combination of these two techniques (in the form of a hybridizedquery translation method) effective when used in the context of an English-Chinese CLIR experiment?

6.1 Test Environment

The document collections and topic sets used in our experiment were providedby the 6th NTCIR workshop (2006/7). This workshop was organized into two



Fig. 3. Overview of the hybrid translation technique.

separate stages. In STAGE1 participants were invited to submit findings re-lated to bilingual ad hoc search tasks, using a document collection consistingof newspaper articles published between 2000 and 2001 (see Table I). STAGE2provided an opportunity for cross-collection analysis using older collections(NTCIR-3, NTCIR-4, and NTCIR-5), and functioned as a check on the robust-ness of techniques submitted in STAGE1. STAGE1 reused the topic sets previ-ously published in relation to NTCIR-3 and NTCIR-4, while STAGE2 used theoriginal topics sets for the older collections. All of documents in the variouscollections were written in traditional Chinese and used the BIG5 characterencoding method.

6.2 Translation Resources

The English-Chinese bilingual dictionary used in our experiments was pro-vided by the Linguistic Data Consortium.5 This dictionary contains exactly

5LDC: http://www.ldc.upenn.edu/



Table I. Overview of the NTCIR Test Collections

Year Workshop Source Number of Docs

1998-1999 NTCIR-3, 4

China Times

132,172

China Times ExpressCommercial TimesChina Daily News

Central and Daily NewsUnited Daily News 249,203

Total 381,375

2000-2001 NTCIR-5, 6

United Daily News 466,564United Express 92,296

Ming Hseng News 169,739Economic Daily News 172,847

Total 901,446

110,834 English words with the accompanying Chinese translations. It wascompiled using various resources including LDC-internal components and theWorld Wide Web, and provides a useful exemplar of the type of bilingual dic-tionaries commonly used in CLIR.

6.3 Text Preprocessing

The most strenuous part of the text preprocessing stage involved character en-coding issues. All of the Chinese documents in our test collection were encodedusing BIG5, which was designed to represent traditional Chinese characters.However, our bilingual dictionary was encoded using an alternative scheme,GB2312, which is used to represent simplified Chinese characters. To pro-vide a unified translation and retrieval environment, we were forced to con-vert the encoding of all of the documents in the test collection and each entryin the bilingual dictionary to a third encoding, Unicode UTF-8. This was ac-complished using an encoding converter written in the Java programming lan-guage.6 Following conversion, all of the Chinese documents were processedusing a segmentation tool7 and the English queries were subjected to theKrovetz stemmer and 571 word stop list.8 Finally, the test collection was in-dexed using the Lemur toolkit.9

6.4 Description of the Experimental Retrieval Systems

In order to investigate the effectiveness of our translation techniques and tostudy the effect of combining them within a single process, we designed a set

6Available from http://www.mandarintools.com7Available from http://www.mandarintools.com8ftp://ftp.cs.cornell.edu/pub/smart/9http://www.lemurproject.org



of eleven experimental retrieval systems. A description of these systems is asfollows:

MONO (monolingual). This system retrieved documents from the testcollection using manually translated versions of the Chinese queries providedby the various NTCIR organizing committees. The performance of a mono-lingual retrieval system such as this has always been considered as anunreachable “upper-bound” of CLIR as the process of automatic translationis inherently noisy.

ALLTRANS (all translations). Here we retrieved documents from the testcollection using all the translations provided by the bilingual dictionary foreach query term.

FIRSTONE first translations. This system retrieved documents from thetest collection using only the first translation suggested for each query term bythe bilingual dictionary. Due to the way in which bilingual dictionaries areusually constructed, the first translation for any word generally equates to themost frequent translation for that term according to the World Wide Web.

COM (basic co-occurrence). In this set-up, the translations for each queryterm were selected using the basic greedy co-occurrence algorithm describedin section 2.1 We used the target document collection to calculate the co-occurrence scorings.

COMUW (unweighted co-occurrence). Identical to above, but here the trans-lations were selected using the unweighted co-occurrence measure describedin section 3.3.1 rather than the basic co-occurrence model.

DCOM (decaying co-occurrence). Here the translations for each query termwere selected using the decaying co-occurrence algorithm described by inSection 2.1

BFMT (BabelFish MT). Here we used the Systran-powered machinetranslation system known as Babelfish10 to translate the English queries intoChinese (for previous use see Kraaij [2001]). This system was included to pro-vide a strong baseline for the remaining runs.

GW (weighted graph analysis). Here we retrieved documents from thedocument collection using query translations suggested by our analysis of aweighted co-occurrence graph (i.e., we used the SW weighting function). Edgesof the graph were weighted using co-occurrence scores derived using the greedyalgorithm in section 2.1 Within this framework there are still three permuta-tions in choosing the suitable centrality scores (see section 6.6).

GUW (unweighted graph analysis). As above, we retrieved documents fromthe collection using query translations suggested by our analysis of the co-occurrence graph, only this time we used an unweighted graph (i.e., we usedthe FW weighting function).

10http://babelfish.altavista.com/



Table II. Overview of the 33 Experimental Retrieval Runs

RunID Weighted Disambiguation Disambiguation OOV

Co-Occurrence Graph Translation

T-MONO

T-ALLTRANS

T-FIRSTONE

T-COM ⋆ ⋆

T-COMUW ⋆

T-DCOM ⋆ ⋆

T-BFMT

T-GW ⋆ ⋆

T-GUW ⋆

T-GW+OOV ⋆ ⋆ ⋆

T-GUW+OOV ⋆ ⋆

D-MONO

D-ALLTRANS

D-FIRSTONE

D-COM ⋆ ⋆

D-COMUW ⋆

D-DCOM ⋆ ⋆

D-BFMT

D-GW ⋆ ⋆

D-GUW ⋆

D-GW+OOV ⋆ ⋆ ⋆

D-GUW+OOV ⋆ ⋆

TD-MONO

TD-ALLTRANS

TD-FIRSTONE

TD-COM ⋆ ⋆

TD-COMUW ⋆

TD-DCOM ⋆ ⋆

TD-BFMT

TD-GW ⋆ ⋆

TD-GUW ⋆

TD-GW+OOV ⋆ ⋆ ⋆

TD-GUW+OOV ⋆ ⋆

GW+OOV (weighted graph analysis with unknown term translation). AsGW, except that the query terms and phrases that were not recognized (i.e.,OOV) were first sent to the pattern matcher for translation and then passed toour disambiguation routine.

GUW+OOV (unweighted graph analysis with unknown term translation. Asthe preceding, only this time we used the unweighted scheme.

Within this framework of 11 retrieval systems there were further permu-tations related to the choice of topic fields. We labeled retrieval runs usingthe Title topic field as T-Runs. Likewise, we labeled retrieval runs using theDescription field as D-Runs. Finally, we identified those runs which used acombination of both fields as TD-runs. In total, this yielded a 33 individualretrieval runs (i.e., 11 systems * three field based permutations) which aresummarized in overview Table II.



6.5 Retrieval Function

All of the retrieval systems ran translated queries against the document col-lections using the Lemur toolkit and the BM25 retrieval function. In this func-tion, the relevance of a document to a given query is defined as:

∑

t∈q

wt

(k1 + 1)tfd(t)

K + tfdt

(k3 + 1)tfdt

k3 + tfqt

wt = logN− df (t) + 0.5

df (t) + 0.5

K = k1 ×

(

(1− b ) + b ×|d|

avg|d|

)

where wt is the Robertson/Sparck Jones weight of t, k1, b , and k3 are parame-ters (set to 1.2, 0.75, and 7 respectively), |d| represents document length andavg|d| stands for average document length.

6.6 Centrality Measures

To demonstrate the interchangeability of the various centrality measurementsdescribed in section 3.3, we varied the centrality calculations from collectionto collection. The results using the NTCIR-6 and NTCIR-3 document collec-tions respectively were achieved using the PageRank-derived random walkermodel (denoted as PR). A series of retrieval runs using the NTCIR-4 collec-tion shows the retrieval performance when a centrality measurement basedon node authority scoring was employed (denoted as AUTH). The experimen-tal run relates to the NTCIR-5 collection lists the results obtained when nodehub scorings are used to determine node centrality (denoted as HUB). A com-parative analysis and overview of the performance of the various centralitymeasurements can be found in section 7.4.

6.7 Parameter Selection

One important part of the experiment involved establishing the optimal valuesfor the various parameters discussed above. The following section describeshow we selected values for α (a component of the decaying co-occurrence model)and β (an important element in our weighting functions).

6.7.1 Selecting the decaying rate. Previous work on this topic suggests thatthe decaying rate parameter (denoted α) in the decaying occurrence model canbe set at any value between 0.2 and 1.0 with broadly similar results, althougha setting of 0.8 will marginally outperform the other values in this range [Gaoet al. 2002]. To verify these findings, we ran a number of DCOM runs againstthe NTCIR-6 test collection, with a spread of settings for the α parameter. Asillustrated by Figure 4, the highest mean average precision scorings for therelaxed and rigid runs were obtained when α = 0.2, with a sharp drop in MAPthereafter. As a result of this test, we decided to set the value of α = 0.2 for allsubsequent experiments.



Fig. 4. The impact of varying the decaying parameter α.

Fig. 5. The impact of varying the similarity threshold β.

6.7.2 Setting the similarity threshold. The similarity threshold β is usedin our weighting functions to exclude word pairings with low co-occurrencescorings. We determined the correct value for this parameter by running anumber of GW runs against the NTCIR-6 test collection, with a spread of set-tings for the β parameter. As illustrated by Figure 5, threshold values belowzero had no effect on MAP scores in either the rigid or relaxed runs, and thebest value for this parameter in terms of mean average precision was β = 0.0.As a result of this test, we decided to set the value of β = 0.0 for all subsequentexperiments.



Fig. 6. The 1R-based pattern generation algorithm.

6.8 Generating the Pattern Set

As mentioned above, our OOV translation technique relies on the applicationof punctuative and linguistic patterns. These patterns can be generated in anumber of different ways. For this particular experiment, we used a modifiedsimple machine learning algorithm known as 1R, to infer the rudimentary pat-terns [Witten and Frank 2005]. This computationally inexpensive algorithmgenerates a one-level decision tree expressed in the form of a set of rules thatall test one particular input attribute (translation pairs in our case). Input tothe algorithm was a list of English names, locations, and technical terms ac-companied by Chinese translations.11 The output of the algorithm was a setof pattern candidates. These candidates were sorted and verified by a bilin-gual individual with a background in linguistics. The total time expended onthis verification stage was one hour. A summary of the modified 1R algorithmis provided in Figure 6. For the purposes of pattern acquisition, we used theGoogle search engine throughout.

7. RESULTS AND DISCUSSION

The following section is divided into three parts. In the first part, we rational-ize the number of experimental systems under consideration using poor initialperformance as grounds for exclusion. In the second part we evaluate the ef-fectiveness of the remaining systems using the NTCIR-6 document collection.In the third part we extend this evaluation into the of realm of cross-collectionanalysis, using earlier collections NTCIR-3, NTCIR-4, and NTCIR-5.

The relevance judgments used in this section were all supplied by the NT-CIR organizing committee. These judgments provide two different thresholdsof relevance—documents which are strictly relevant to a query (i.e., rigid rele-vance), and documents which are likely to be relevant to a query (i.e., relaxedrelevance). As shown in the results below, we used both measures in our eval-uation when reporting NTCIR-6 results, along with mean average precision(MAP), recall-precision (R-prec), and recorded precision at ten documents. Theresults for cross-collection analysis will concentrate on rigid relevance criteria.

11http://www.bigyuwen.com/www/jstd/2005-12-27/1135633115d89918.html



Fig. 7. The decaying co-occurence model vs. the basic co-occurence model.

The Wilcoxon signed-rank test with 95% confidence level was also employedto test the statistical significance of the different runs. Bold figures in theresults tables indicate a statistically significant difference in the retrieval per-formance of the labeled run.

7.1 Rationalization of Experimental Systems

In the previous section we introduced a large number experimental retrievalsystems. Early experimentation suggested that this rather exhaustive listcould be shortened considerably by removing those systems which performedpoorly when compared with their immediate siblings. Working on this premise,in this stage of the experiment we were able to remove four experimental sys-tems from further consideration. The following subsections will provide a ra-tionale for each deletion.

7.1.1 The decaying co-occurrence model (DCOM). In a somewhat surpris-ing early result, the performance of retrieval runs using a decaying co-occurrence model (DCOM) were largely inferior to those runs using the basicco-occurrence model (COM). The diagram in Figure 7, which compares meanaverage precision scorings obtained for a variety of title and description runsusing the NTCIR-6 collection, indicates that COM outperforms DCOM in fiveout six cases. Clearly, there is some disagreement here with the work of Gaoet al. [2002], who observed an increase in retrieval effectiveness when a decay-ing factor was introduced to the calculation of mutual information. Althoughthe precise reason for this disagreement is not known at this time, one possibleexplanation might relate to the various test collections under scrutiny. Gao’swork centered on the TREC test collections, while this study uses materialsfrom an NTCIR corpus. Whatever the underlying reasons, this result led toour first performance-based deletion. We discontinued our experimentationwith the DCOM retrieval system at this point.

7.1.2 The unweighted co-occurrence model(COMUW). In section 3.3.1 wedemonstrated that the basic co-occurrence model was equivalent to theweighted indegree measure of centrality in a graph generated using occurrence



Fig. 8. The basic co-occurence model vs. the unweighted co-occurrence model.

data. We also proposed another measure of centrality using the unweightedindegree measure. We performed an early experiment testing the differencesbetween these two measures, with a legitimate expectation that the weightedvariant COM would outperform COMUW. Our expectations were broadly sat-isfied (see Figure 8), although we did record an unusually strong performanceby COMUW when using longer NTCIR-6 queries. However, as this anomalywas not statistically significant (according to the Wilcoxon signed-rank testwith 95% confidence level) and did not persist into cross-collection analysis,we eventually decided to exclude COMUW from further consideration.

7.1.3 The unweighted graph model (GUW, GUW+OOV). Our final exclu-sion related to those systems employing an unweighted graph. As revealedin Figure 9, retrieval runs employing a weighted graph easily outstripped thecounterpart unweighted model, irrespective of the centrality measurement orthe type of relevance criteria applied. This seems to confirm that edge weight-ing plays an important role when using graph-based analysis to select a finaltranslation target. For this reason we decided to exclude all experimental sys-tems that used an unweighted graph from the main results section.

7.2 Main Results (NTCIR-6)

Our main experimental results, which describe the performance of the remain-ing seven experimental systems runs on the NTCIR-6 document collection, areshown in Tables III and IV. As illustrated by the data, document retrievalwith no disambiguation of the candidate translations (ALLTRANS) was thelowest performer in terms of mean average precision. This result was notsurprising and merely confirms the need for an efficient process for resolv-ing translation ambiguities. The English-Chinese dictionary provided by LDCcontains a large number of translation alternatives, thereby creating greaterscope for ambiguity and translation error. Using the first translation offeredby the bilingual dictionary (FIRSTONE) always led to an improvement inretrieval effectiveness.



Fig. 9. The weighted graph model vs. the unweighted graph model.

When the final translation for each query term was selected using a basicco-occurrence model (COM), retrieval effectiveness always exceeded ALL-TRANS and FIRSTONE.12 There was a noticeable improvement in retrievaleffectiveness when processing longer queries (i.e., D-runs), but a far more mod-est increase for shorter queries (i.e., T-runs).

The results obtained with the machine translation system were somewhatequivocal. BFMT faired well against the basic co-occurrence model in the Title

field and combined field runs, but actually scored worse than COM when theDescription field was used.

The empirical results obtained using our graph based method in isolationwere very promising. Pleasingly, the weighted graph GW outperformed thebasic co-occurrence model and the machine translation method in the majorityof retrieval runs, with the best scores relating to longer queries. It is likelythat this bias toward longer queries is an artifact of the graph-based disam-biguation approach, with a higher number of query terms having a positiveeffect via the amount of global information available during processing.

12This result is inconsistent with Liu et al. [2005], who observed the opposite trend in the contextof a retrieval experiment using a TREC collection.



Table III. Results for Various Retrieval Runs Using NTCIR-6 (Relaxed Relevance)

RunID MAP R-Prec P@10 %Mono Improvement Improvement

Over Over

BFMT COM

T-MONO 0.2497 0.2911 0.39 100.00% N/A N/AT-ALLTRANS 0.0789 0.0957 0.106 31.60% N/A N/AT-FIRSTONE 0.0953 0.1219 0.16 38.17% N/A N/AT-COM 0.1075 0.1288 0.18 43.05% N/A N/AT-BFMT 0.1142 0.1361 0.178 45.73% N/A 6.23%T-GW 0.1022 0.1295 0.174 40.93% −10.51% −4.93%T-GW+OOV 0.1562 0.1843 0.246 62.56% 36.78% 45.30%

D-MONO 0.1995 0.2565 0.372 100.00% N/A N/AD-ALLTRANS 0.0697 0.0866 0.104 34.94% N/A N/AD-FIRSTONE 0.0589 0.0712 0.096 29.52% N/A N/AD-COM 0.0991 0.1189 0.178 49.67% N/A N/AD-BFMT 0.0971 0.1295 0.166 48.67% N/A −2.02%D-GW 0.1186 0.1481 0.194 59.45% 22.14% 19.68%D-GW+OOV 0.1712 0.1964 0.26 85.81% 76.31% 72.75%

TD-MONO 0.2565 0.3059 0.444 100.00% N/A N/ATD-ALLTRANS 0.0879 0.1077 0.152 34.27% N/A N/ATD-FIRSTONE 0.0884 0.1086 0.166 34.46% N/A N/ATD-COM 0.1016 0.1322 0.182 39.61% N/A N/ATD-BFMT 0.1103 0.1395 0.212 43.00% N/A 8.56%TD-GW 0.1412 0.1704 0.232 55.05% 28.01% 38.98%

TD-GW+OOV 0.1785 0.2098 0.268 69.59% 61.83% 75.69%

The results obtained when our two translation techniques were combinedwere excellent. Runs employing the successful weighted co-occurrence graphin partnership with OOV pattern matching (GW+OOV) produced the highestnonmonolingual MAP score across the board, touching a creditable 88.59% ofmonolingual performance and recording statistically significant improvementsover the COM and BFMT baseline systems in every single run. An illustra-tion of these findings is supplied in Figure 10, which gathers together recall-precision plots for the four leading retrieval systems.

7.3 Cross Collection Analysis (NTCIR-3, NTCIR-4, NTCIR-5)

Table V shows the results obtained when we repeated all of the retrieval runsdescribed above using document collections and topic sets released by NTCIRin previous years. As shown in the table, our hybrid translation technique(GW+OOV) continued to perform well, realizing between 60% and 80% ofmonolingual retrieval performance. While these statistics may not seem par-ticularly impressive when placed alongside recent work using the TREC doc-ument collections (where cross-language retrieval performance in excess ofmonolingual performance has been reported [Gao and Nie 2006]), this type ofintercollection comparison is ultimately misleading. In the context of the NT-CIR English-Chinese document collections, the norm for monolingual perfor-mance is much lower than the TREC benchmark, currently within the range of40% to 65% (see Table VI for a collection-by-collection comparison of the lead-ing systems in NTCIR-6). At the time of writing, the NTCIR collections seem to



Table IV. Results for Various Retrieval Runs Using NTCIR-6 (Rigid Relevance)

RunID MAP R-Prec P@10 %Mono Improvement Improvement

Over Over

BFMT COM

T-MONO 0.1772 0.2169 0.24 100.00% N/A N/AT-ALLTRANS 0.0474 0.0629 0.05 26.75% N/A N/AT-FIRSTONE 0.0756 0.0967 0.098 42.66% N/A N/AT-COM 0.0797 0.0943 0.108 44.98% N/A N/AT-BFMT 0.0827 0.0997 0.124 46.67% N/A 3.76%T-GW 0.0768 0.1021 0.1 43.34% −7.13% −3.64%T-GW+OOV 0.1224 0.1484 0.16 69.07% 48.00% 53.58%

D-MONO 0.1455 0.1986 0.222 100.00% N/A N/AD-ALLTRANS 0.0452 0.0561 0.06 31.07% N/A N/AD-FIRSTONE 0.0492 0.0591 0.064 33.81% N/A N/AD-COM 0.0843 0.1017 0.13 57.94% N/A N/AD-BFMT 0.0731 0.0951 0.112 50.24% N/A −13.29%D-GW 0.0849 0.1095 0.124 58.35% 16.14% 0.71%D-GW+OOV 0.1289 0.1514 0.18 88.59% 76.33% 52.91%

TD-MONO 0.1826 0.2394 0.26 100.00% N/A N/ATD-ALLTRANS 0.0605 0.0769 0.092 33.13% N/A N/ATD-FIRSTONE 0.074 0.0963 0.11 40.53% N/A N/ATD-COM 0.0817 0.105 0.122 44.74% N/A N/ATD-BFMT 0.0873 0.1031 0.136 47.81% N/A 6.85%TD-GW 0.0983 0.1299 0.138 53.83% 12.60% 20.32%

TD-GW+OOV 0.1249 0.1542 0.17 68.40% 43.07% 52.88%

offer a more challenging retrieval environment than their TREC counterparts,which are constructed in such a way that even a rudimentary disambiguationstrategy like COM can deliver 90% of monolingual performance [Gao and Nie2006]. Therefore, we are very pleased with the overall performance of our hy-brid technique, which has achieved some of the highest scores with regards tomonolingual performance ever recorded on the NTCIR collections without evenresorting to query expansion (used by many of our fellow competitors [Kwokand Dinstl 2007; Wu et al. 2007]).

Moving on, we now consider the performance of GW+OOV with respectto the non-monolingual runs. As illustrated by Table VII and Figure 11,GW+OOV always exceeded the COM and BFMT baselines by a statisticallysignificant margin in every single test run, irrespective of the centrality mea-surement we applied. We take this cross-collection performance as a whole tobe favorable, indicating the stability of our hybrid query translation technique.

Scrutiny of the remaining results reveals a number of interesting anom-alies. On several occasions, usually during retrieval runs addressing the Title

field, the performance of ALLTRANS approached that of FIRSTONE, even ex-ceeding it on the NTCIR-4 document collection. We believe that the influentialfactor at work in this case is the short length of the queries. This result merelyconfirms the intuitive assumption that ambiguity has a positive relationshipwith the number of query terms under consideration (i.e., retrieval runs usinglonger queries will tend to perform worse than shorter queries).

A second anomaly relates to the basic co-occurrence model, COM, which oc-casionally performs worse than ALLTRANS or FIRSTONE. Again, this artifact



Fig. 10. Comparison of various retrival runs using the NTCIR-6 collection.

seems to be related to query length. It may be that short queries do not providesufficient context for the successful operation of a technique based on simpleterm-term co-occurrence. Luckily, graph-based methods using more sophisti-cated measures of centrality seem to have no such limitations.

7.4 Comparison of the Various Centrality Measures

Table VIII shows a side-by-side comparison of the various centrality measure-ments described in section 3.3 in the context of retrieval runs carried out on theNTCIR-6 document collection. As we anticipated, the random walk centrality



Table V. Results for Various Retrieval Runs Using NTCIR-3, 4, 5 Test Collections

RunID NTCIR-3 NTCIR-4 NTCIR-5

MAP %Mono MAP %Mono MAP %Mono

T-MONO 0.1995 N/A 0.1728 N/A 0.2923 N/AT-ALLTRANS 0.0648 32.48% 0.0506 29.28% 0.0882 30.17%T-FIRSTONE 0.0594 29.77% 0.0551 31.89% 0.0830 28.40%T-COM 0.0596 29.87% 0.0755 43.69% 0.1125 38.49%T-BFMT 0.0772 38.70% 0.0869 50.29% 0.1046 35.79%T-GW 0.0687 34.44% 0.0781 45.20% 0.1368 46.80%T-GW+OOV 0.1298 65.06% 0.1301 75.29% 0.1846 63.15%

D-MONO 0.2510 N/A 0.1977 N/A 0.2976 N/AD-ALLTRANS 0.0276 11.00% 0.0390 19.73% 0.0825 27.72%D-FIRSTONE 0.0531 21.16% 0.0581 29.39% 0.1080 36.29%D-COM 0.0718 28.61% 0.0753 38.09% 0.1285 43.18%D-BFMT 0.0930 37.05% 0.0741 37.48% 0.1010 33.94%D-GW 0.0936 37.29% 0.0811 41.02% 0.1155 38.81%D-GW+OOV 0.1542 61.43% 0.1327 67.12% 0.1842 61.90%

TD-MONO 0.2111 N/A 0.2150 N/A 0.3182 N/ATD-ALLTRANS 0.0494 23.40% 0.0517 24.05% 0.1193 37.49%TD-FIRSTONE 0.0674 31.93% 0.0642 29.86% 0.1221 38.37%TD-COM 0.0780 36.95% 0.0858 39.91% 0.1519 47.74%TD-BFMT 0.1272 60.20% 0.0841 39.12% 0.1399 43.97%TD-GW 0.0977 46.28% 0.0977 45.44% 0.1599 50.25%TD-GW+OOV 0.1414 66.98% 0.1298 60.37% 0.2310 72.60%

Table VI. Cross Collection Results Submitted by Other NTCIR-6 Participants (D-runs/Rigid)

GroupID NTCIR-3 NTCIR-4 NTCIR-5

Mono E-C %Mono Mono E-C %Mono Mono E-C %Mono

PIRCS 0.294 0.175 59.52% 0.218 0.142 65.14% 0.374 0.242 64.71%NCUTW 0.191 0.072 37.70% 0.12 0.063 52.50% 0.275 0.179 65.09%AINLP 0.22 0.094 42.73% 0.142 0.062 43.66% 0.297 0.155 52.19%RALI 0.198 0.12 60.61% 0.171 0.102 59.65% 0.281 0.128 45.55%

measure outscored the measurements based on either hubs or authorities inthe vast majority of cases, while the difference between the MAP scorings forcentrality measure HUB and centrality measure AUTH was fairly marginal.This is probably related to the undirected nature of the co-occurrence graph,which unlike its directed cousin will tend towards node equality in this sort ofanalysis.

For confirmation of these findings we repeated this experiment using theNTCIR-5 document collection (see Table IX). However, the results we obtainedfailed to verify random walk as the undisputed centrality measurement ofchoice (i.e. retrieval runs using the HUB centrality measurement performedmuch stronger on this set of documents than the previous collection). A possi-ble conclusion that could be drawn from the results summarized in these twotables is that more research with respect to the correct centrality measure-ment will be necessary in the future. It is likely that this process will involvetrials of a metric which combines the more useful elements of the random walkmethod in partnership with consideration of graph hubs and authorities.



Table VII. Comparison of Various Retrieval Runs Over the NTCIR-3, 4, 5 Test Collections

RunID NTCIR-3 NTCIR-4 NTCIR-5

Impro. Impro. Impro. Impro. Impro. Impro.

Over Over Over Over Over Over

COM BFMT COM BFMT COM BFMT

T-BFMT 29.53% N/A 15.10% N/A −7.02% N/AT-GW 15.27% −11.01% 3.44% −10.13% 21.60% 30.78%T-GW+OOV 117.79% 68.13% 72.32% 49.71% 64.09% 76.48%

D-BFMT 29.53% N/A −1.59% N/A −21.40% N/AD-GW 30.36% 0.65% 7.70% 9.45% −10.12% 14.36%D-GW+OOV 114.76% 65.81% 76.23% 79.08% 43.35% 82.38%

TD-BFMT 63.08% N/A −1.98% N/A −7.90% N/ATD-GW 25.26% −23.19% 13.87% 16.17% 5.27% 14.30%TD-GW+OOV 81.28% 11.16% 51.28% 54.34% 52.07% 65.12%

Table VIII. Comparison of Alternative Centrality Measures Using NTCIR-6 Collection

RunID Relax Rigid

MAP R-Prec P@10 MAP R-Prec P@10

T-GW-AUTH 0.1014 0.1296 0.1840 0.0807 0.1049 0.1080T-GW-HUB 0.0941 0.1210 0.1560 0.0703 0.0946 0.0860T-GW-PR 0.1022 0.1295 0.1740 0.0768 0.1021 0.1000

D-GW-AUTH 0.0912 0.1190 0.1460 0.0720 0.0970 0.1040D-GW-HUB 0.1038 0.1397 0.1840 0.0824 0.1071 0.1180D-GW-PR 0.1186 0.1481 0.1940 0.0849 0.1095 0.1240

TD-GW-AUTH 0.1391 0.1668 0.2340 0.1057 0.1259 0.1420TD-GW-HUB 0.1368 0.1717 0.2420 0.0998 0.1279 0.1440TD-GW-PR 0.1412 0.1704 0.2320 0.0983 0.1299 0.1380

7.5 Translation Success Rate and Error Analysis

There were 22, 29, and 40 OOV terms in the NTCIR-3, NTCIR-4, and NTCIR-5query sets respectively.13 Most of these terms were proper nouns or acronyms(see Table X for examples taken from NTCIR-5). Our pattern matching algo-rithm, in combination with the graph-based disambiguation protocol, success-fully translated 18, 23, and 28 of the OOV terms, meaning our translationsperfectly matched the manual translations provided by the various NTCIRcommittees. In our opinion, an OOV translation success rate of 81.8%, 79.3%,and 70% in return for a meager expenditure of resources emphatically vali-dates the use of linguistics patterns alongside traditional statistical analysis.

The OOV terms which were not successfully translated may have been outof date. Our method collects all translation candidates from the contempora-neous Web. The query terms we worked with are several years old. It couldbe that the persons, organizations, or acronyms which are referred to in thosequery sets are no longer as prominent on the Web as they once were. Thiswould inevitably have a negative impact on our ability to generate appropri-ate translation candidates. Furthermore, extremely well-known English termsare sometimes used directly in Chinese Web pages (i.e., without an accompany-

13We did not count the OOV terms in the NTCIR-6 query set as it reused query sets from previousyears.



Fig. 11. Cross-collection analysis of selected experimental runs.

Table IX. Comparison of Alternative Centrality Measures Using NTCIR-5 Collection

RunID Relax Rigid

MAP R-Prec P@10 MAP R-Prec P@10

T-GW-AUTH 0.1150 0.1138 0.1600 0.1064 0.1079 0.1280T-GW-HUB 0.1343 0.1505 0.2200 0.1368 0.1416 0.1680T-GW-PR 0.1064 0.1093 0.1480 0.0961 0.0918 0.1160

D-GW-AUTH 0.1337 0.1512 0.1920 0.0900 0.1025 0.1340D-GW-HUB 0.1259 0.1387 0.1720 0.1155 0.1316 0.1540D-GW-PR 0.1158 0.1297 0.1920 0.1061 0.1180 0.1360

TD-GW-AUTH 0.1779 0.1874 0.2420 0.1563 0.1615 0.2040TD-GW-HUB 0.1840 0.1980 0.2520 0.1599 0.1624 0.2080TD-GW-PR 0.1723 0.1754 0.2560 0.1537 0.1674 0.1960

ing translation). This would obviously have a negative impact on our patternbased-translation technique.

7.6 Encoding Issues

The last issue to be addressed in this section relates to character encoding. Asdiscussed in Section 6.3, all of the documents collections used in our experi-ments, together with the bilingual dictionary used for naive translation, wereconverted to UTF-8 to provide a unified translation and retrieval environment.However, this harmonization of character encoding sets, while necessary in



Table X. Description of OOV Terms in the NTCIR-5 Query Set

Category Count Examples

Names of Individuals 18Alberto Fujimori

Harry Potter

Acronyms and Abbreviations 9G8

AOL

Names of Organisations 4WarnerTaliban

Object Names 3 KirskNames of Places 3 KosovoScientific Fields 3 Nanotechnology

Total 40

Table XI. NTCIR-5 T-runs Monolingual Performance Using Different Character Encodings

Encoding MAP(Relax) MAP(Rigid)

UTF-8 0.3436 0.2923BIG5 0.4294 0.3692GB2312 0.4131 0.355

terms of experimental design, should be taken into consideration when evalu-ating our results. We conducted a final retrieval experiment to illustrate thisproblem. Table XI shows the negative effect of successive transcoding of theoriginal NTCIR-5 document collection (BIG5 to GB2312 to UTF-8) on monolin-gual performance in Title field runs. As illustrated, each reencoding of the doc-ument set results in a significant drop in retrieval effectiveness. Clearly, thisissue of character encoding represents a major challenge for the CLIR researchcommunity—a problem which as yet lacks a standard, verified response.

8. CONCLUSIONS

In this article we have described a hybrid technique for query translationwhich can be used for English-Chinese CLIR. This hybrid technique marries agraph-based model for the resolution of translation ambiguity with a pattern-based method for the translation of unknown terms. The combination of thesetwo techniques performed well on the NTCIR-6 test collection, delivering sta-tistically significant improvements over various baseline systems. The robust-ness of this hybridized approach to query translation was confirmed duringextensive cross-collection analysis using earlier NTCIR document collections.

Use of the various NTCIR document collections during our experiments hasled to some interesting observations. There seems to be a distinct differencebetween NTCIR collections and the TREC alternatives commonly used by re-searchers in this field. Cross-language retrieval performances in excess ofmonolingual performance have frequently been reported by individuals work-ing with the TREC document collections. By contrast, the norm for cross-language retrieval systems using the NTCIR English-Chinese document col-lections is much lower (usually within the range of 40% to 65%). Future workis currently being planned that will involve a side-by-side examination of theTREC and NTCIR document sets to investigate this inconsistency.



ACKNOWLEDGMENTS

Our thanks to the WebTech Group of the University of Nottingham for manyuseful discussions.

REFERENCES

ABDULJALEEL, N. AND LARKEY, L. S. 2003. Statistical transliteration for English-Arabic crosslanguage information retrieval. In Proceedings of the 12th International Conference on Informa-

tion and Knowledge Management (CIKM’03). New Orleans, LA. ACM Press. 139–146.

ADRIANI, M. 2000. Using statistical term similarity for sense disambiguationin cross-languageinformation retrieval. Inform. Retr. 2, 1, 71–82.

BALLESTEROS, L. AND CROFT, W. B. 1998. Resolving ambiguity for cross-language retrieval.In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and

Development in Information Retrieval. Melbourne, Australia. ACM Press. 64–71.

BRIN, S. AND PAGE, L. 1998. The anatomy of a large-scale hypertextual Web search engine.In Proceedings of the 7th International World Wide Web Conference (WWW’98).

BRODY, S., NAVIGLI, R., AND LAPATA, M. 2006. Ensemble methods for unsupervised wsd.In Proceedings of the 21st International Conference on Computational Linguistics and the 44th

Annual Meeting of the ACL (ACL’06). Association for Computational Linguistics, Morristown,NJ, 97–104.

BUCKLEY, C., MITRA, M., WALZ, J., AND CARDIE, C. 2000. Using clustering and superconceptswithin smart: Trec 6. Inform. Process. Manage. 36, 1, 109–131.

CAO, G., GAO, J., AND NIE, J.-Y. 2007. A system to mine large-scale bilingual dictionaries frommonolingual Web pages. In Machine Translation Summit XI. Copenhagen, Denmark, 57–64.

CHEN, J., LI, Q., AND JIA, W. 2005. Automatically generating an e-textbook on the Web. World

Wide Web 8, 4, 377–394.

CHEN, K.-J. AND MA, W.-Y. 2002. Unknown word extraction for Chinese Documents. In Proceed-

ings of the 19th International Conference on Computational Linguistics (COLIN’02). Associationfor Computational Linguistics, Morristown, NJ, 1–7.

CHENG, P.-J., TENG, J.-W., CHEN, R.-C., WANG, J.-H., LU, W.-H., AND CHIEN, L.-F. 2004.Translating unknown queries with Web corpora for cross-language information retrieval. In Pro-

ceedings of the 27th Annual International ACM SIGIR Conference on Research and Development

in Information Retrieval (SIGIR’04). Sheffield, UK. ACM Press, 146–153.

CIMIANO, P., HANDSCHUH, S., AND STAAB, S. 2004. Towards the self-annotating Web. In Proceed-

ings of the 13th International Conference on World Wide Web (WWW’04). New York, NY. ACMPress. 462–471.

CIMIANO, P., LADWIG, G., AND STAAB, S. 2005. Gimme the context: context-driven automatic se-mantic annotation with c-pankow. In Proceedings of the 14th International Conference on World

Wide Web (WWW’05). Chiba, Japan. ACM Press, 332–341.

ERKAN, G. AND RADEV, D. R. 2004. Lexrank: Graph-based lexical centrality as salience in textsummarization. J. AI Res. 22, 457–479.

ETZIONI, O., CAFARELLA, M., DOWNEY, D., KOK, S., POPESCU, A.-M., SHAKED, T., SODER-LAND, S., WELD, D. S., AND YATES, A. 2004. Web-scale information extraction in knowitall:(preliminary results). In Proceedings of the 13th International Conference on World Wide Web

(WWW’04). New York, NY. ACM Press, 100–110.

FEDERICO, M. AND BERTOLDI, N. 2002. Statistical cross-language information retrieval usingn-best query translations. In Proceedings of the 25th Annual International ACM SIGIR Confer-

ence on Research and Development in Information Retrieval (SIGIR’02). Tampere, Finland. ACMPress, 167–174.

FUJII, A. AND ISHIKAWA, T. 2001. Japanese/English cross-language information retrieval: Explo-ration of query translation and transliteration. Comput. Human. 35, 4, 389–420.

GAO, J. AND NIE, J.-Y. 2006. A study of statistical models for query translation: Finding a goodunit of translation. In Proceedings of the 29th Annual International ACM SIGIR Conference



on Research and Development in Information Retrieval (SIGIR’06). Seattle, WA. ACM Press,194–201.

GAO, J., ZHOU, M., NIE, J.-Y., HE, H., AND CHEN, W. 2002. Resolving query translation ambi-guity using a decaying co-occurrence model and syntactic dependence relations. In Proceedings

of the 25th Annual International ACM SIGIR Conference on Research and Development in Infor-

mation Retrieval (SIGIR’02). New York, NY. ACM Press, 183–190.

HEARST, M. A. 1992. Automatic acquisition of hyponyms from large text corpora. In Proceedings of

the 14th Conference on Computational Linguistics (COLING’92). Association for ComputationalLinguistics, Morristown, NJ, 539–545.

IWANSKA, L., MATA, N., AND KRUGER, K. 1999. Fully automatic acquisition of taxonomic knowl-edge from large corpora of texts: Limited syntax knowledge representation system based onnatural language. In Proceedings of the 11th International Symposium on Foundations of Intel-

ligent Systems (ISMIS’95). London, UK. Springer-Verlag, 430–438.

JANG, M.-G., MYAENG, S. H., AND PARK, S. Y. 1999. Using mutual information to resolve querytranslation ambiguities and query term weighting. In Proceedings of the 37th Annual Meeting of

the Association on Computational Linguistics (COLING’99). College Park, MD. Association forComputational Linguistics, 223–229.

KANG, I.-H. AND KIM, G. 2000. English-to-Korean transliteration using multiple unbounded over-lapping phoneme chunks. In Proceedings of the 18th Conference on Computational Linguistics -

Volume 1. Saarbrcken, Germany. Association for Computational Linguistics. 418–424.

KLEINBERG, J. M. 1999. Authoritative sources in a hyperlinked environment. J. ACM 46, 5, 604–632. 324140.

KRAAIJ, W. 2001. Tno at clef-2001. In Proceedings of Workshop on Cross-Language Evaluation

Forum (CLEF’01). Darmstadt, Germany, 79–83.

KURLAND, O. AND LEE, L. 2005. Pagerank without hyperlinks: structural re-ranking using linksinduced by language models. In Proceedings of the 28th Annual International ACM SIGIR Con-

ference on Research and Development in Information Retrieval (SIGIR’05). Salvador, Brazil.ACM Press, 306–313. 1076087.

KWOK, K.-L. AND DINSTL, N. 2007. Ntcir-6 monolingual Chinese and English-Chinese cross lan-guage retrieval experiments using pircs. In Proceedings of the 6th NTCIR Workshop Meeting.NII, Tokyo, Japan, 190–197.

LIU, B., CHIN, C. W., AND NG, H. T. 2003. Mining topic-specific concepts and definitions on theWeb. In Proceedings of the 12th International Conference on World Wide Web (WWW’03). NewYork, NY. ACM Press, 251–260.

LIU, Y., JIN, R., AND CHAI, J. Y. 2005. A maximum coherence model for dictionary-based cross-language information retrieval. In Proceedings of the 28th Annual International ACM SIGIR

Conference on Research and Development in Information Retrieval (SIGIR’05). Salvador, Brazil.ACM Press, 536–543. 1076125.

LU, C., XU, Y., AND GEVA, S. 2007. Translation disambiguation in Web-based translation ex-traction for English-Chinese CLIR. In Proceedings of the 2007 ACM Symposium on Applied

Computing (SAC’07). New York, NY. ACM Press, 819–823.

LU, W.-H., CHIEN, L.-F., AND LEE, H.-J. 2002. Translation of Web queries using anchor text

mining. ACM Trans. Asian Lang. Inform. Process. 1, 2, 159–172. 568958.

MAEDA, A., SADAT, F., YOSHIKAWA, M., AND UEMURA, S. 2000. Query term disambiguationfor Web cross-language information retrieval using a search engine. In Proceedings of the 5th

International Workshop on Information Retrieval with Asian Languages (IRAL’00). Hong Kong.ACM Press, 25–32.

MANNING, C. D. AND SCHUTZE, H. 1999. Foundations of Statistical Natural Language Process-

ing. MIT Press, Cambridge, MA.

MIHALCEA, R. 2005. Unsupervised large-vocabulary word sense disambiguation with graph-basedalgorithms for sequence data labeling. In Proceedings of the Conference on Human Language

Technology and Empirical Methods in Natural Language Processing (HLT’05). Morristown, NJ.Association for Computational Linguistics, 411–418.



MIHALCEA, R. AND TARAU, P. 2004. Textrankbringing order into texts. In Proceedings of the

Conference on Empirical Methods in Natural Language Processing (EMNLP’04). 404–411.

MONZ, C. AND DORR, B. J. 2005. Iterative translation disambiguation for cross-language infor-mation retrieval. In Proceedings of the 28th Annual International ACM SIGIR Conference on

Research and Development in Information Retrieval (SIGIR’05). Salvador, Brazil. ACM Press,520–527.

PIRKOLA, A. 1998. The effects of query structure and dictionary setups in dictionary-based cross-language information retrieval. In Proceedings of the 21st Annual International ACM SIGIR

Conference on Research and Development in Information Retrieval (SIGIR’98). Melbourne,Australia. ACM Press, 55–63.

PIRKOLA, A., KESKUSTALO, H., LEPPANEN, E., KANSALA, A.-P., AND JARVELIN, K. 2002.Targeted s-gram matching: a novel n-gram matching technique for cross- and monolingual wordform variants. Inform. Res. 7, 2.

PIRKOLA, A., TOIVONEN, J., KESKUSTALO, H., VISALA, K., J, K., AND RVELIN. 2003. Fuzzytranslation of cross-lingual spelling variants. In Proceedings of the 26th Annual International

ACM SIGIR Conference on Research and Development in Informaion Retrieval (SIGIR’03).Toronto, Canada. ACM Press, 345–352.

QU, Y., GREFENSTETTE, G., AND EVANS, D. A. 2003. Automatic transliteration for Japanese-to-English text retrieval. In Proceedings of the 26th Annual International ACM SIGIR Confer-

ence on Research and Development in Information Retrieval (SIGIR’03). Toronto, Canada. ACMPress, 353–360.

SERBAN, R., TEIJE, A. T., HARMELEN, F. V., MARCOS, M., AND C., P. 2005. Ontology-drivenextraction of linguistic patterns for modelling clinical guidelines. In Proceedings of the 10th

European Conference on Artificial Intelligence in Medicine (AIME’05). 194–253.

SPERER, R. AND OARD, D. W. 2000. Structured translation for cross-language information re-trieval. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research

and Development in Information Retrieval (SIGIR’00). New York, NY. ACM Press, 120–127.

VIRGA, P. AND KHUDANPUR, S. 2003. Transliteration of proper names in cross-lingual informa-tion retrieval. In Proceedings of the ACL Workshop on Multilingual and Mixed-language Named

Entity Recognition, Vol. 15. Association for Computational Linguistics, 57–64.

VOORHEES, E. AND HARMAN, D. 2000. Overview of the ninth text retrieval conference. In Pro-

ceedings of the 9th Text Retrieval Conference. NIST, 1–28.

WITTEN, I. H. AND FRANK, E. 2005. Data Mining: Practical Machine Learning Tools and Tech-

niques. Academic Press, San Diego, CA.

WU, Y.-C., TSAI, K.-C., AND YANG, J.-C. 2007. Ncu in bilingual information retrieval experimentsat NTCIR-6. In Proceedings of the 6th NTCIR Workshop Meeting. NII, Tokyo, Japan, 133–139.

ZHANG, Y. AND VINES, P. 2004. Using the Web for automated translation extraction in cross-language information retrieval. In Proceedings of the 27th Annual International ACM SIGIR

Conference on Research and Development in Information Retrieval (SIGIR’04). Sheffield, UK.

ACM Press, 162–169.

ZHANG, Y., VINES, P., AND ZOBEL, J. 2005. Chinese OOV translation and post-translation queryexpansion in Chinese–English cross-lingual information retrieval. ACM Trans. Asian Lang.

Inform. Process. 4, 2, 57–77.

ZHOU, D., GOULDING, J., TRURAN, M., AND BRAILSFORD, T. 2007. Llama: automatic hypertextgeneration utilizing language models. In Proceedings of the 18th Conference on Hypertext and

Hypermedia (HT’07). New York, NY. ACM Press, 77–80.

ZHOU, D., TRURAN, M., BRAILSFORD, T., AND ASHMAN, H. 2007. NTCIR-6 experiments usingpattern matched translation extraction. In Proceedings of the 6th NTCIR Workshop Meeting.NII, Tokyo, Japan, 145–151.

Received December 2007; revised February 2008; accepted March 2008