information retrieval and web search cross language information retrieval instructor: rada mihalcea...
TRANSCRIPT
![Page 1: Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page: rada/CSCE5300](https://reader030.vdocuments.mx/reader030/viewer/2022032612/56649ebe5503460f94bc77ad/html5/thumbnails/1.jpg)
Information Retrieval and Web Search
Cross Language Information Retrieval
Instructor: Rada MihalceaClass web page: http://www.cs.unt.edu/~rada/CSCE5300
Some of the slides are from a course taught by Doug Oard at U. Maryland
![Page 2: Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page: rada/CSCE5300](https://reader030.vdocuments.mx/reader030/viewer/2022032612/56649ebe5503460f94bc77ad/html5/thumbnails/2.jpg)
The General Problem
Find documents written in any language– Using queries expressed in a single language
![Page 3: Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page: rada/CSCE5300](https://reader030.vdocuments.mx/reader030/viewer/2022032612/56649ebe5503460f94bc77ad/html5/thumbnails/3.jpg)
9
Why Do Cross-Language IR?
•When users can read several languages– Eliminates multiple queries– Query in most fluent language
•Monolingual users can also benefit– If translations can be provided– If it suffices to know that a document exists– If text captions are used to search for images
![Page 4: Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page: rada/CSCE5300](https://reader030.vdocuments.mx/reader030/viewer/2022032612/56649ebe5503460f94bc77ad/html5/thumbnails/4.jpg)
Source: Michael Lesk, How Much Information is there in the World?
![Page 5: Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page: rada/CSCE5300](https://reader030.vdocuments.mx/reader030/viewer/2022032612/56649ebe5503460f94bc77ad/html5/thumbnails/5.jpg)
Supply Side: Internet Hosts
English 33,878,764 .com .net .edu .us .mil .uk.ca .au .org .gov .nz .ie
Japanese 1,686,534 .jp
German 1,684,396 .de .at .ch
French 653,916 .fr .be
Dutch 564,129 .nl
Finnish 546,244 .fi
Spanish 473,422 .es .mx .ar .cl .co .uy
Chinese 458,509 .tw .hk .sg .cn
Swedish 431,809 .se
Source: Network Wizards Jan 99 Internet Domain Survey
Guess – What will be the most widely used language on the Web in2010?
![Page 6: Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page: rada/CSCE5300](https://reader030.vdocuments.mx/reader030/viewer/2022032612/56649ebe5503460f94bc77ad/html5/thumbnails/6.jpg)
Demand Side: Number of SpeakersChinese 885,000,000English 450,000,000Hindi-Urdu 333,000,000Spanish 266,000,000Portuguese 175,000,000Bengali 162,000,000Russian 153,000,000Arabic 150,000,000Japanese 126,000,000French 122,000,000
Source: http://www.g11n.com/faq.html
![Page 7: Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page: rada/CSCE5300](https://reader030.vdocuments.mx/reader030/viewer/2022032612/56649ebe5503460f94bc77ad/html5/thumbnails/7.jpg)
Search Technology
LanguageIdentification
EnglishFeature
Assignment
ChineseFeature
Assignment
Cross-LanguageMatching
MonolingualChinese
Matching
3: 0.91 4: 0.575: 0.36
1: 0.722: 0.48
ChineseQuery
ChineseFeature
Assignment
![Page 8: Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page: rada/CSCE5300](https://reader030.vdocuments.mx/reader030/viewer/2022032612/56649ebe5503460f94bc77ad/html5/thumbnails/8.jpg)
24
Language Identification
•Can be specified using metadata– Included in HTTP and HTML
•Can be determined using word-scale features– Which dictionary gets the most hits?
•Can be determined using subword features– Letter n-grams, for example
![Page 9: Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page: rada/CSCE5300](https://reader030.vdocuments.mx/reader030/viewer/2022032612/56649ebe5503460f94bc77ad/html5/thumbnails/9.jpg)
10
Design Decisions
•What to index?– Free text or controlled vocabulary
•What to translate?– Queries or documents
•Where to get translation knowledge?
![Page 10: Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page: rada/CSCE5300](https://reader030.vdocuments.mx/reader030/viewer/2022032612/56649ebe5503460f94bc77ad/html5/thumbnails/10.jpg)
Query Vector Translation
MonolingualEnglish
Matching
3: 0.91 4: 0.575: 0.36
Query(Vector)
Translation
ChineseQuery
Features
EnglishDocumentFeatures
![Page 11: Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page: rada/CSCE5300](https://reader030.vdocuments.mx/reader030/viewer/2022032612/56649ebe5503460f94bc77ad/html5/thumbnails/11.jpg)
Document Vector Translation
MonolingualChinese
Matching
3: 0.91 4: 0.575: 0.36
ChineseQuery
Features
EnglishDocumentFeatures
Document(Vector)
Translation
![Page 12: Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page: rada/CSCE5300](https://reader030.vdocuments.mx/reader030/viewer/2022032612/56649ebe5503460f94bc77ad/html5/thumbnails/12.jpg)
Matching Interlingual Representations
InterlingualMatching
3: 0.91 4: 0.575: 0.36
Query“Folding In”
ChineseQuery
Features
EnglishDocumentFeatures
Document“Folding In”
![Page 13: Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page: rada/CSCE5300](https://reader030.vdocuments.mx/reader030/viewer/2022032612/56649ebe5503460f94bc77ad/html5/thumbnails/13.jpg)
23
Query vs. Document Translation
•Query translation– Very efficient for short queries
• Not as big an advantage for relevance feedback– Hard to resolve ambiguous query terms
•Document translation– May be needed by the selection interface
• And supports adaptive filtering well – Slow, but only need to do it once per document
• Poor scale-up to large numbers of languages
![Page 14: Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page: rada/CSCE5300](https://reader030.vdocuments.mx/reader030/viewer/2022032612/56649ebe5503460f94bc77ad/html5/thumbnails/14.jpg)
11
Term-aligned Sentence-aligned Document-aligned Unaligned
Parallel Comparable
Knowledge-based Corpus-based
Controlled Vocabulary Free Text
Cross-Language Text Retrieval
Query Translation Document Translation
Text Translation Vector Translation
Ontology-based Dictionary-based
Thesaurus-based
![Page 15: Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page: rada/CSCE5300](https://reader030.vdocuments.mx/reader030/viewer/2022032612/56649ebe5503460f94bc77ad/html5/thumbnails/15.jpg)
Translation Knowledge
•A lexicon– e.g., extract term list from a bilingual dictionary
•Corpora– Parallel or comparable, linked or unlinked
•Algorithmic– e.g., transliteration rules, cognate matching
•The user
![Page 16: Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page: rada/CSCE5300](https://reader030.vdocuments.mx/reader030/viewer/2022032612/56649ebe5503460f94bc77ad/html5/thumbnails/16.jpg)
22
Types of Lexicons
•Ontology– Representation of concepts and relationships
•Thesaurus– Ontology specialized for retrieval
•Bilingual lexicon– Ontology specialized for machine translation
•Bilingual dictionary– Ontology specialized for human translation
![Page 17: Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page: rada/CSCE5300](https://reader030.vdocuments.mx/reader030/viewer/2022032612/56649ebe5503460f94bc77ad/html5/thumbnails/17.jpg)
16
Multilingual Thesauri
•Adapt the knowledge structure– Cultural differences influence indexing choices
•Use language-independent descriptors– Matched to a unique term in each language
•Three construction techniques– Build it from scratch– Translate an existing thesaurus– Merge monolingual thesauri
![Page 18: Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page: rada/CSCE5300](https://reader030.vdocuments.mx/reader030/viewer/2022032612/56649ebe5503460f94bc77ad/html5/thumbnails/18.jpg)
27
Machine Readable Dictionaries
•Based on printed bilingual dictionaries– Becoming widely available
•Used to produce bilingual term lists– Cross-language term mappings are accessible
• Sometimes listed in order of most common usage– Some knowledge structure is also present
• Hard to extract and represent automatically
•The challenge is to pick the right translation
![Page 19: Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page: rada/CSCE5300](https://reader030.vdocuments.mx/reader030/viewer/2022032612/56649ebe5503460f94bc77ad/html5/thumbnails/19.jpg)
28
Unconstrained Query Translation
•Replace each word with every translation– Typically 5-10 translations per word
•About 50% of monolingual effectiveness– Ambiguity is a serious problem– Example: Fly (English)
• 8 word senses (e.g., to fly a flag)• 13 Spanish translations (enarbolar, ondear, …)• 38 English retranslations (hoist, brandish, lift…)
![Page 20: Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page: rada/CSCE5300](https://reader030.vdocuments.mx/reader030/viewer/2022032612/56649ebe5503460f94bc77ad/html5/thumbnails/20.jpg)
29
Exploiting Part-of-Speech Tags
• Constrain translations by part of speech– Noun, verb, adjective, …– Effective taggers are available
• Works well when queries are full sentences– Short queries provide little basis for tagging
• Constrained matching can hurt monolingual IR– Nouns in queries often match verbs in documents– This is why stemming usually improves
performance
![Page 21: Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page: rada/CSCE5300](https://reader030.vdocuments.mx/reader030/viewer/2022032612/56649ebe5503460f94bc77ad/html5/thumbnails/21.jpg)
30
Phrase Indexing
•Improves retrieval effectiveness two ways– Phrases are less ambiguous than single words– Idiomatic phrases translate as a single concept
•Three ways to identify phrases– Semantic (e.g., appears in a dictionary)– Syntactic (e.g., parse as a noun phrase)– Cooccurrence (words found together often)
•Semantic phrase results are impressive
![Page 22: Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page: rada/CSCE5300](https://reader030.vdocuments.mx/reader030/viewer/2022032612/56649ebe5503460f94bc77ad/html5/thumbnails/22.jpg)
32
Types of Bilingual Corpora
•Parallel corpora: translation-equivalent pairs– Document pairs– Sentence pairs – Term pairs
•Comparable corpora– Content-equivalent document pairs– E.g. newspaper articles in different languages, on the
same day (for the same event)
•Unaligned corpora – Content from the same domain
![Page 23: Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page: rada/CSCE5300](https://reader030.vdocuments.mx/reader030/viewer/2022032612/56649ebe5503460f94bc77ad/html5/thumbnails/23.jpg)
33
Pseudo-Relevance Feedback
• Enter query terms in French
• Find top French documents in parallel corpus
• Construct a query from English translations
• Perform a monolingual free text search Top ranked FrenchDocuments French
Text Retrieval System
Alta Vista
FrenchQueryTerms
EnglishTranslations
English Web Pages
ParallelCorpus
![Page 24: Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page: rada/CSCE5300](https://reader030.vdocuments.mx/reader030/viewer/2022032612/56649ebe5503460f94bc77ad/html5/thumbnails/24.jpg)
34
Learning From Document Pairs
• Count how often each term occurs in each pair– Treat each pair as a single document
E1 E2 E3 E4 E5 S1 S2 S3 S4
Doc 1
Doc 2
Doc 3
Doc 4
Doc 5
4 2 2 1
8 4 4 2
2 2 2 1
2 1 2 1
4 1 2 1
English Terms Spanish Terms
![Page 25: Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page: rada/CSCE5300](https://reader030.vdocuments.mx/reader030/viewer/2022032612/56649ebe5503460f94bc77ad/html5/thumbnails/25.jpg)
35
Similarity-Based Dictionaries
•Automatically developed from aligned documents– Terms E1 and E3 are used in similar ways
• Terms E1 & S1 (or E3 & S4) are even more similar
•For each term, find most similar in other language– Retain only the top few (5 or so)
•Performs as well as dictionary-based techniques– Evaluated on a comparable corpus of news stories
• Stories were automatically linked based on date and subject
![Page 26: Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page: rada/CSCE5300](https://reader030.vdocuments.mx/reader030/viewer/2022032612/56649ebe5503460f94bc77ad/html5/thumbnails/26.jpg)
37
Latent Semantic Indexing
•Designed for better monolingual effectiveness– Works well across languages too
• Cross-language is just a type of term choice variation
•Produces short dense document vectors– Better than long sparse ones for adaptive filtering
• Training data needs grow with dimensionality– Not as good for retrieval efficiency
• Always 300 multiplications, even for short queries
![Page 27: Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page: rada/CSCE5300](https://reader030.vdocuments.mx/reader030/viewer/2022032612/56649ebe5503460f94bc77ad/html5/thumbnails/27.jpg)
38
Sentence-Aligned Parallel Corpora
•Easily constructed from aligned documents– Match pattern of relative sentence lengths
•Not yet used directly for effective retrieval– But all experiments have included domain shift
•Good first step for term alignment– Sentences define a natural context
![Page 28: Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page: rada/CSCE5300](https://reader030.vdocuments.mx/reader030/viewer/2022032612/56649ebe5503460f94bc77ad/html5/thumbnails/28.jpg)
39
Cooccurrence-Based Translation
•Align terms using cooccurrence statistics– How often do a term pair occur in sentence pairs?
• Weighted by relative position in the sentences– Retain term pairs that occur unusually often
•Useful for query translation– Excellent results when the domain is the same
•Also practical for document translation– Term usage reinforces good translations
![Page 29: Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page: rada/CSCE5300](https://reader030.vdocuments.mx/reader030/viewer/2022032612/56649ebe5503460f94bc77ad/html5/thumbnails/29.jpg)
40
Exploiting Unaligned Corpora
•Documents about the same set of subjects– No known relationship between document pairs– Easily available in many applications
•Two approaches– Use a dictionary for rough translation
• But refine it using the unaligned bilingual corpus– Use a dictionary to find alignments in the corpus
• Then extract translation knowledge from the alignments
![Page 30: Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page: rada/CSCE5300](https://reader030.vdocuments.mx/reader030/viewer/2022032612/56649ebe5503460f94bc77ad/html5/thumbnails/30.jpg)
8
CLIR Evaluation Resources
• Electronic texts– Text Retrieval Conference (E, F, G, I)– Topic Detection and Tracking (E, C)
• Document images– No evaluation programs yet
• Recorded speech– Topic Detection and Tracking (E, C)
• Sign language– No evaluation programs yet
• CLEF Evaluation
• http://clef.iei.pi.cnr.it:2002/