cross-language name search raghavendra udupamicrosoft research india mitesh khapraiit bombay...
Post on 19-Dec-2015
217 views
TRANSCRIPT
Cross-Language Name Search
Raghavendra UdupaMicrosoft Research India
Mitesh Khapra IIT Bombay
NAACL-HLT 2010June 3, 2010
Improving the Multilingual User Experience of Wikipedia Using Cross-Language Name Search
Name Search
• Searching people directories by name.
Facebook Friend Search Outlook Address Book Search
Cross-Language Name Search
• Searching people directories by name across languages.
Query in Russian Query in Hebrew
Challenges
• Script and phonetic differences
• Large Directories– Millions of names
• Multi-word Names and Partial Matches
• Spelling Variations
Naive Approach
• Transliterate and Search– Rashid רשיד
• Limitations– Slow as it involves the intermediate step of
transliteration generation.– Machine Transliteration is not perfect• Transliteration errors affect search results
• Is Transliteration Generation necessary?
Our Approach
רשיד
אנטוני
Rashid
Names Language-Independent Geometric Representation
Similarity
cos𝜃≈1
cos𝜃≈−1
Search OverviewAaronBharatCecileDavid
MichaelSanjayStuartDanielRashmiAlbertRashidKumar
Query NamesGeometric Distance
רשיד
Geometric Nearest Neighbor Search
What is the advantage?
• Can scale to reasonably large name directories• Compact geometric representation
• 50 dimensional space• 6 M names
• Search is effective and efficient• Geometric nearest-neighbor search using Approximate
Nearest Neighbor (ANN) [Arya et al, 1998]• ~1s per query for searching 6 M names
• >20 % improvement in MRR over Transliterate-and-Search
What is the challenge?
• Language/Script Independent Representation• Learning common geometric feature
space from training data• Multi-Word Names and Partial Matches• Maximum Weighted Bipartite Matching
Previous Work
• Language Independent Representation(2007) Canonical Correlation Analysis: An overview with application to learning methods.D. Hardoon et al., Neural Computation 2004.
• Transliteration Equivalence(2006) Named entity transliteration and discovery from multilingual comparable corpora.A. Klementiev and A. Roth, HLT-NAACL 2006.
(2009) Learning better transliterations.J. Pasternack and D. Roth, CIKM 2009.
(2010) Transliteration equivalence using canonical correlation analysis.R. Udupa and M. Khapra, ECIR 2010.
Common Feature SpaceAaronBharat
RickDavid
MichaelSanjayStuartDanielRashmiAlbertRashidKumar
ಆರನ್ �ಭರತ್ �ರಿಕ್ �
ಡೇವಿಡ್ �ಮೈ�ಕೆಲ್ �ಸಂ�ಜಯಸಂ��ವರ್ಟ್ ��ಡೇನಿಯಲ್ �
ರಶ್ಮಿ�ಆಲ್ಬ�ರ್ಟ್ ��ರಶ್ಮಿದ್ �ಕು!ಮಾ#ರ್ �
Training Data Parallel Names Similar Vectors
Common Feature Space
Feature Vectors
^R Ra as sh hi id d$ ic …
1 1 1 1 1 1 1 0 …
^ರ ರಶ ಶ ೀ �� ೀ��ದ ದ ೀ � ೀ �$ ಆಲ …
1 1 1 1 1 1 0 …
Feature Vectors
Learning Common Feature Space
Canonical Correlation Analysis
Canonical Correlation Analysis
(1) Aaron
(2) Bernard
(3) David
(4) William
אהרן (1)
ברנאר(2)
דוד (3)
ויליאם (4)
1 2
3
4
1
2
3
4
1
1 2
2
3
3
4
4
Learning Common Feature Space
Canonical Correlation Analysis (Hoteling, 1936)
Multi-Word Names
קלי מליסה
Melissa Jane Kelly
0.97 0.91
Score = Maximum Weighted Matching / (m – n + 1)
Experimental Setup
•Name Directory:• English Wikipedia Titles• 6 Million Titles, 2 Million Unique Words
•Query Languages:• Russian, Hebrew, Kannada, Tamil, Hindi, Bengali• 1000 multi-word names in each language•Baseline:• State-of-the-art Machine Transliteration (NEWS 2009)
Experimental Results
MRR0 1
Very Bad Perfect
Competitor GEOM-SEARCH
Algorithm Russian Kannada Tamil Hindi
TRANS-SEARCH 0.47 0.52 0.29 0.49
GEOM-SEARCH 0.56 0.69 0.49 0.69
Conclusions
• Pros– Data driven: Easy to include new languages.– Not training data hungry: a few thousand parallel names
suffice.– Bridge languages are useful: feature space for (P,Q) can
be learnt using only data in (P,R) and (Q,R) (Khapra and Udupa, AAAI 2010)
– Fast search: ~1s for 6 M names directory – Applications:
• Cross-Language Wikipedia Search• Spelling Correction of Personal Names
Raghavendra UdupaMicrosoft Research India
Mitesh Khapra IIT Bombay
NAACL-HLT 2010June 3, 2010
Improving the Multilingual User Experience of Wikipedia Using Cross-Language Name Search
Thank you!