querying across languages: a dictionary-based approach to multilingual information retrieval...
TRANSCRIPT
Querying Across Languages: A Querying Across Languages: A Dictionary-Based Approach to Dictionary-Based Approach to
Multilingual Information RetrievalMultilingual Information Retrieval
Doctorate Course Doctorate Course Web Information RetrievalWeb Information Retrieval
SpeakerSpeakerGaia TrecarichiGaia Trecarichi
OutlineOutline
What is Multilingual Information Retrieval (MLIR)What is Multilingual Information Retrieval (MLIR)
Basic Approaches to MLIRBasic Approaches to MLIR
Xerox Experimental ApproachXerox Experimental Approach
Resource Requirements for MLIRResource Requirements for MLIR
Experimental ResultsExperimental Results
Conclusions and Future ExtensionsConclusions and Future Extensions
Detailed Query AnalysisDetailed Query Analysis
Sample Query ProfileSample Query Profile
GoalGoal
GoalGoal
To build a fully-functional MLIR ( too much time and To build a fully-functional MLIR ( too much time and resources needed )resources needed )
ISIS NOTNOT
To To exploreexplore the the most important factorsmost important factors in making in making MLIR effective MLIR effective
ISIS
5 Definitions for MLIR5 Definitions for MLIR1.1. IR in any IR in any language otherlanguage other thanthan English English
2.2. IR on a parallel document collection or on a IR on a parallel document collection or on a multilingual document collection where the multilingual document collection where the search search space is restrictedspace is restricted to the query language to the query language
3.3. IR on a IR on a monolingual document collectionmonolingual document collection which can be which can be queried in multiple languagesqueried in multiple languages
4.4. IR on a multilingual document collection, where IR on a multilingual document collection, where queries can queries can retrieve documents in multiple languagesretrieve documents in multiple languages
5.5. IR on IR on multilingual documentsmultilingual documents, i.e. more than one , i.e. more than one language can be present in the individual documentslanguage can be present in the individual documents
Basic Approaches to MLIR Basic Approaches to MLIR IR systems rank documents according to statistical IR systems rank documents according to statistical
similarity measures based on the cooccurrence of terms similarity measures based on the cooccurrence of terms in queries and documentsin queries and documents
Mechanism for query or document translationMechanism for query or document translation
TechniquesTechniques for the problem of interlingual term for the problem of interlingual term correspondencecorrespondence
Query translation is Query translation is easier easier butbut doesn’t provide much context doesn’t provide much context Document translation could be Document translation could be betterbetter butbut is is costing costing (time, storage (time, storage
resources)resources)
Term Vector TranslationTerm Vector Translation
Text TranslationText Translation
Latent Semantic CoindexingLatent Semantic Coindexing
Text TranslationText Translation High-end approach to MLIR (NLP and text High-end approach to MLIR (NLP and text
generation techniques)generation techniques)
Direct Mapping Direct Mapping of query from the source language into of query from the source language into one or more target languages by using an MT systemone or more target languages by using an MT system
Direct Resolution of ambiguity Direct Resolution of ambiguity by using structural by using structural information from the source language textinformation from the source language text
PROPRO Extensive body of researchExtensive body of research on MT on MT Commercial productsCommercial products available available
CONSCONS Low performanceLow performance of current MT systems [Radwan, 1994] of current MT systems [Radwan, 1994]
Term Vector TranslationTerm Vector Translation Direct Mapping Direct Mapping of each word in the query written in of each word in the query written in
the source language into the source language into all of its possible definitions all of its possible definitions in in the target languagesthe target languages
Uses Uses transfer dictionariestransfer dictionaries or or parallel aligned corpus parallel aligned corpus for for the direct mappingthe direct mapping
Should each term be weighted according to the number of Should each term be weighted according to the number of translations?translations?
Issues related with term weighting strategiesIssues related with term weighting strategies
Should more common translations be weighted proportionally higher?
Vector Space Models can be used as retrieval strategiesVector Space Models can be used as retrieval strategies
What resources do we use to obtain this information?
Latent Semantic CoindexingLatent Semantic Coindexing Indirect DerivationIndirect Derivation of query translation by using a of query translation by using a
training corpustraining corpus
Uses Uses Singular Value DecompositionSingular Value Decomposition of parallel of parallel document collection to obtain term vector document collection to obtain term vector representationrepresentation
Term vector representaion are Term vector representaion are comparable across all the comparable across all the languageslanguages of the collection (documents are represented of the collection (documents are represented as language-independent numerical vectors)as language-independent numerical vectors)
Query can retrieve a relevant document Query can retrieve a relevant document even if they even if they have no words in commonhave no words in common
Create a Create a reduced-dimension Semantic Spacereduced-dimension Semantic Space in which in which related terms are near each otherrelated terms are near each other
LSI vs Standard Vector ModelLSI vs Standard Vector Model Standard Vector ModelStandard Vector Model
Treat words as if they are independentTreat words as if they are independent
LSILSI Term-term inter-relationships are automatically modeled and used Term-term inter-relationships are automatically modeled and used
to improve retrieval by numerically analysing existing texts (no to improve retrieval by numerically analysing existing texts (no need for external dictionaries, thesauri or knowledge bases)need for external dictionaries, thesauri or knowledge bases)
Represent documents as linear combinations of orthogonal termsRepresent documents as linear combinations of orthogonal terms
Represents terms as continuous values on each of the k orthogonal Represents terms as continuous values on each of the k orthogonal indexing dimensionsindexing dimensions
Resource RequirementsResource Requirements Support for character set of each language is neededSupport for character set of each language is needed
Facilities for automatic language recognitionFacilities for automatic language recognition
Morphological Analyzer (PoS recogniMorphological Analyzer (PoS recognitiontion, , stemming stemming algorithms, algorithms, inflectional analyzers)inflectional analyzers)
Ex: German word WeingEx: German word Weingäärrtnertnergenossenschaften is analyzed as the genossenschaften is analyzed as the feminine plural noun Wein#Gfeminine plural noun Wein#Gäärtner# Genosse(n)#schajtrtner# Genosse(n)#schajt
Crucial to find term entries in bilingual dictionariesCrucial to find term entries in bilingual dictionaries
Resources for query translationResources for query translation
Machine Translation SystemMachine Translation System
Transfer DictionariesTransfer Dictionaries
Parallel texts and/or monolingual domain-specific corporaParallel texts and/or monolingual domain-specific corpora
Resources for Query TranslationResources for Query Translation MT SystemMT System
Transfer dictionaries (Bilingual Thesauri)Transfer dictionaries (Bilingual Thesauri)
Parallel TextsParallel Texts
For direct term vector translationFor direct term vector translation
For direct query translationFor direct query translation
To extract relationships between terms for term vector translation or To extract relationships between terms for term vector translation or to get indirect query translation (ex. SLI)to get indirect query translation (ex. SLI)
Source of terminology to be used when parallel texts are not Source of terminology to be used when parallel texts are not availableavailable
Extracted from bilingual general dictionaries which include lots of Extracted from bilingual general dictionaries which include lots of “noise” vocabulary“noise” vocabulary
Domain-specific monolingual corporaDomain-specific monolingual corpora
Transfer Dictionaries vs Transfer Dictionaries vs Parallel TextsParallel Texts
Transfer DictionariesTransfer Dictionaries Conversion from bilingual dictionaries is a non-trivial effortConversion from bilingual dictionaries is a non-trivial effort
Parallel CorporaParallel Corpora
Needed in large quantity to train statistical models of great Needed in large quantity to train statistical models of great sophisticationsophistication
Generate term translation vectors with probabilities [Brown, 1993] Generate term translation vectors with probabilities [Brown, 1993]
Provide narrow but deep coverage (probabilities are domain Provide narrow but deep coverage (probabilities are domain specific)specific)
Provide broad but shallow coverage of the languageProvide broad but shallow coverage of the language
Translation probabilities are not availableTranslation probabilities are not available Most technical terminology is missingMost technical terminology is missing
Xerox Experimental Approach 1Xerox Experimental Approach 1 Evaluation in Multilingual IR Evaluation in Multilingual IR
Uses query with known relevance judgementUses query with known relevance judgement
Start with queries, documents, and relevance judgments in a Start with queries, documents, and relevance judgments in a single language single language
Translates the queries into another language by human Translates the queries into another language by human translatorstranslators
Translated queries are retranslated by the MLIR systemTranslated queries are retranslated by the MLIR system
Results are compared to the original queries to get a good Results are compared to the original queries to get a good sense of the relative performance of the MLIR systemsense of the relative performance of the MLIR system
Xerox Experimental Approach 2Xerox Experimental Approach 2 Experimental SettingExperimental Setting
Translated French queries and English documentsTranslated French queries and English documents
Conversion of an online bilingual French => English Conversion of an online bilingual French => English dictionary to a WORD-BASED transfer dictionary dictionary to a WORD-BASED transfer dictionary suitable for text retrievalsuitable for text retrieval
TIPSTER text collection and queries 51-100 from TREC TIPSTER text collection and queries 51-100 from TREC experiments [Harman, 1995]experiments [Harman, 1995]
Term vector translation modelTerm vector translation model
Bilingual Transfer Dictionary to generate the modelBilingual Transfer Dictionary to generate the model
Short version of queries (average lenght of 7 words)Short version of queries (average lenght of 7 words)
Xerox Experimental Approach 3Xerox Experimental Approach 3 MLIR ProcessMLIR Process
1.1. Query is morphologically analyzed and each term is replaced Query is morphologically analyzed and each term is replaced by its inflectional rootby its inflectional root
2.2. Each root is looked up in the bilingual transfer dictionary and Each root is looked up in the bilingual transfer dictionary and builds a translated query by taking the concatenation of all builds a translated query by taking the concatenation of all term translationsterm translations
3.3. The translated query is sent to a traditional monolingual IR The translated query is sent to a traditional monolingual IR systemsystem
Specialized term weighting and resolving ambiguity in translation Specialized term weighting and resolving ambiguity in translation are ignoredare ignored
NotesNotes
Vector Space Model is used to measure similarity between query Vector Space Model is used to measure similarity between query and each documentand each document
Experimental ResultsExperimental Results Comparing the original English queries to three Comparing the original English queries to three
retranslation generated by different versions of the transfer retranslation generated by different versions of the transfer dictionarydictionary
Three tranfer dictionary versions: Three tranfer dictionary versions: automatic word-based, automatic word-based, manual word-based and manual multi-word transfer dictionarymanual word-based and manual multi-word transfer dictionary
Average precision at 5,10,15 and 20 documents retrieved for the Average precision at 5,10,15 and 20 documents retrieved for the original English queries and the translation given by the different TDoriginal English queries and the translation given by the different TD
Original
English
Automatic
word-based
transfer dictionary
Manual
word-based
transfer dictionary
Manual
multi-word
transfer dictionary
0.393 0.235 0.269 0.357
Detailed Query AnalysisDetailed Query Analysis 11 Comparison of the performance of the translated (Tr) and original Comparison of the performance of the translated (Tr) and original
(Orig) English queries. Values given are the number of queries in each (Orig) English queries. Values given are the number of queries in each categorycategory
Performance
Automatic
word-based
transfer dictionary
Manual
word-based
transfer dictionary
Manual
multi-word
transfer dictionary
Tr > Orig
Tr ~ Orig
Tr < Orig
1
19
22
3
22
17
4
26
12
0.0 < Tr < Orig
Tr = 0.0
10
12
9
8
9
3
Improvement in performance as more manual effort is applied to the Improvement in performance as more manual effort is applied to the dictionary construction processdictionary construction process
Some queries which perform much better in their translated versions
Detailed Query AnalysisDetailed Query Analysis 22 Detailed Failure AnalysisDetailed Failure Analysis
Recognizing and translating multi-word expressions is crucial to Recognizing and translating multi-word expressions is crucial to success in MLIR (in contrast to monolingual IR)success in MLIR (in contrast to monolingual IR)
Carried out on the worse Carried out on the worse 1717 queries when using word-based queries when using word-based dictionarydictionary
9 9 queries lost information as a result of the queries lost information as a result of the failure to translate failure to translate multi-word expressionsmulti-word expressions correctly, correctly, 88 had problems due to had problems due to ambiguity ambiguity in translationin translation (i.e. extraneous definitions added to query), and (i.e. extraneous definitions added to query), and 44 suffered from a suffered from a loss in retranslation loss in retranslation (meaning decays with repeated (meaning decays with repeated translations)translations)
Individual components of phrases often have very diferent Individual components of phrases often have very diferent meanings in translation, so the entire sense of the phrase is often meanings in translation, so the entire sense of the phrase is often lostlost
Sample Query ProfileSample Query Profile 11 EnglishEnglish: original intent or interpretation of amendments to the U.S. : original intent or interpretation of amendments to the U.S.
ConstitutionConstitution
FrenchFrench: l’intention premkre ou une interpretation d’un amendment de : l’intention premkre ou une interpretation d’un amendment de la constitution des USAla constitution des USA
Term vector retranslationTerm vector retranslation
• intention - intention benefit
• premier - first initial bottom early front top leading basic primary original
• interpretation - interpretation
• amendment - amendment enrichment enriching agent
• constitution - formation settlement constitution
• USA - USA
Sample Query ProfileSample Query Profile 22
Version Precision Reasons for decay
Orig Eng
LR
TA1
TA2
Trans Eng
0.54
0.34
0.19
0.10
0.05
intent => intention, U.S. => USA
constitution, amendement
original, intention
The decay in performance of query 76 from the original The decay in performance of query 76 from the original English (orig Eng) to the translated English (traus Eng) due English (orig Eng) to the translated English (traus Eng) due to translation ambiguity (TA) and loss in retranslation (LR)to translation ambiguity (TA) and loss in retranslation (LR)
Future ExtensionsFuture Extensions
Additional loss in retranslation errors due to the experimented design Additional loss in retranslation errors due to the experimented design which cannot be avoided (i.e. the ambiguity introduced by the human which cannot be avoided (i.e. the ambiguity introduced by the human translator)translator)
ConclusionsConclusions Two primary sources of error in the current MLIR systemTwo primary sources of error in the current MLIR system
missing translationsmissing translations of multi-word expressions and of multi-word expressions and unresolved unresolved ambiguity in word-based translationambiguity in word-based translation
Improving automatically generated transfer dictionariesImproving automatically generated transfer dictionaries
Extracting MWE (gathering terminology lists from various MWE (gathering terminology lists from various specialized domains, performing terminology extraction from corporaspecialized domains, performing terminology extraction from corpora
Resolving ambiguity (using target language texts, term weighting Resolving ambiguity (using target language texts, term weighting strategies, user interactive tools)strategies, user interactive tools)
Using models other than the vector space model (i.e. weighted Using models other than the vector space model (i.e. weighted boolean model)boolean model)
THANK YOU!THANK YOU!