cross-language name search raghavendra udupamicrosoft research india mitesh khapraiit bombay...

20
Cross-Language Name Search Raghavendra Udupa Microsoft Research India Mitesh Khapra IIT Bombay NAACL-HLT 2010 June 3, 2010 Improving the Multilingual User Experience of Wikipedia Using Cross- Language Name Search

Post on 19-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Cross-Language Name Search Raghavendra UdupaMicrosoft Research India Mitesh KhapraIIT Bombay NAACL-HLT 2010 June 3, 2010 Improving the Multilingual User

Cross-Language Name Search

Raghavendra UdupaMicrosoft Research India

Mitesh Khapra IIT Bombay

NAACL-HLT 2010June 3, 2010

Improving the Multilingual User Experience of Wikipedia Using Cross-Language Name Search

Page 2: Cross-Language Name Search Raghavendra UdupaMicrosoft Research India Mitesh KhapraIIT Bombay NAACL-HLT 2010 June 3, 2010 Improving the Multilingual User

Name Search

• Searching people directories by name.

Facebook Friend Search Outlook Address Book Search

Page 3: Cross-Language Name Search Raghavendra UdupaMicrosoft Research India Mitesh KhapraIIT Bombay NAACL-HLT 2010 June 3, 2010 Improving the Multilingual User

Cross-Language Name Search

• Searching people directories by name across languages.

Query in Russian Query in Hebrew

Page 4: Cross-Language Name Search Raghavendra UdupaMicrosoft Research India Mitesh KhapraIIT Bombay NAACL-HLT 2010 June 3, 2010 Improving the Multilingual User

Challenges

• Script and phonetic differences

• Large Directories– Millions of names

• Multi-word Names and Partial Matches

• Spelling Variations

Page 5: Cross-Language Name Search Raghavendra UdupaMicrosoft Research India Mitesh KhapraIIT Bombay NAACL-HLT 2010 June 3, 2010 Improving the Multilingual User

Naive Approach

• Transliterate and Search– Rashid רשיד

• Limitations– Slow as it involves the intermediate step of

transliteration generation.– Machine Transliteration is not perfect• Transliteration errors affect search results

• Is Transliteration Generation necessary?

Page 6: Cross-Language Name Search Raghavendra UdupaMicrosoft Research India Mitesh KhapraIIT Bombay NAACL-HLT 2010 June 3, 2010 Improving the Multilingual User

Our Approach

רשיד

   אנטוני

Rashid

Names Language-Independent Geometric Representation

Similarity

cos𝜃≈1

cos𝜃≈−1

Page 7: Cross-Language Name Search Raghavendra UdupaMicrosoft Research India Mitesh KhapraIIT Bombay NAACL-HLT 2010 June 3, 2010 Improving the Multilingual User

Search OverviewAaronBharatCecileDavid

MichaelSanjayStuartDanielRashmiAlbertRashidKumar

Query NamesGeometric Distance

רשיד

Geometric Nearest Neighbor Search

Page 8: Cross-Language Name Search Raghavendra UdupaMicrosoft Research India Mitesh KhapraIIT Bombay NAACL-HLT 2010 June 3, 2010 Improving the Multilingual User

What is the advantage?

• Can scale to reasonably large name directories• Compact geometric representation

• 50 dimensional space• 6 M names

• Search is effective and efficient• Geometric nearest-neighbor search using Approximate

Nearest Neighbor (ANN) [Arya et al, 1998]• ~1s per query for searching 6 M names

• >20 % improvement in MRR over Transliterate-and-Search

Page 9: Cross-Language Name Search Raghavendra UdupaMicrosoft Research India Mitesh KhapraIIT Bombay NAACL-HLT 2010 June 3, 2010 Improving the Multilingual User

What is the challenge?

• Language/Script Independent Representation• Learning common geometric feature

space from training data• Multi-Word Names and Partial Matches• Maximum Weighted Bipartite Matching

Page 10: Cross-Language Name Search Raghavendra UdupaMicrosoft Research India Mitesh KhapraIIT Bombay NAACL-HLT 2010 June 3, 2010 Improving the Multilingual User

Previous Work

• Language Independent Representation(2007) Canonical Correlation Analysis: An overview with application to learning methods.D. Hardoon et al., Neural Computation 2004.

• Transliteration Equivalence(2006) Named entity transliteration and discovery from multilingual comparable corpora.A. Klementiev and A. Roth, HLT-NAACL 2006.

(2009) Learning better transliterations.J. Pasternack and D. Roth, CIKM 2009.

(2010) Transliteration equivalence using canonical correlation analysis.R. Udupa and M. Khapra, ECIR 2010.

Page 11: Cross-Language Name Search Raghavendra UdupaMicrosoft Research India Mitesh KhapraIIT Bombay NAACL-HLT 2010 June 3, 2010 Improving the Multilingual User

Common Feature SpaceAaronBharat

RickDavid

MichaelSanjayStuartDanielRashmiAlbertRashidKumar

ಆರನ್ �ಭರತ್ �ರಿಕ್ �

ಡೇವಿಡ್ �ಮೈ�ಕೆಲ್ �ಸಂ�ಜಯಸಂ��ವರ್ಟ್ ��ಡೇನಿಯಲ್ �

ರಶ್ಮಿ�ಆಲ್ಬ�ರ್ಟ್ ��ರಶ್ಮಿದ್ �ಕು!ಮಾ#ರ್ �

Training Data Parallel Names Similar Vectors

Common Feature Space

Page 12: Cross-Language Name Search Raghavendra UdupaMicrosoft Research India Mitesh KhapraIIT Bombay NAACL-HLT 2010 June 3, 2010 Improving the Multilingual User

Feature Vectors

^R Ra as sh hi id d$ ic …

1 1 1 1 1 1 1 0 …

^ರ ರಶ ಶ ‌ೀ �� ‌ೀ��ದ ದ ‌ೀ � ‌ೀ �$ ಆಲ …

1 1 1 1 1 1 0 …

Feature Vectors

Page 13: Cross-Language Name Search Raghavendra UdupaMicrosoft Research India Mitesh KhapraIIT Bombay NAACL-HLT 2010 June 3, 2010 Improving the Multilingual User

Learning Common Feature Space

Canonical Correlation Analysis

Page 14: Cross-Language Name Search Raghavendra UdupaMicrosoft Research India Mitesh KhapraIIT Bombay NAACL-HLT 2010 June 3, 2010 Improving the Multilingual User

Canonical Correlation Analysis

(1) Aaron

(2) Bernard

(3) David

(4) William

אהרן (1)

ברנאר(2)

דוד (3)

ויליאם (4)

1 2

3

4

1

2

3

4

1

1 2

2

3

3

4

4

Page 15: Cross-Language Name Search Raghavendra UdupaMicrosoft Research India Mitesh KhapraIIT Bombay NAACL-HLT 2010 June 3, 2010 Improving the Multilingual User

Learning Common Feature Space

Canonical Correlation Analysis (Hoteling, 1936)

Page 16: Cross-Language Name Search Raghavendra UdupaMicrosoft Research India Mitesh KhapraIIT Bombay NAACL-HLT 2010 June 3, 2010 Improving the Multilingual User

Multi-Word Names

קלי מליסה

Melissa Jane Kelly

0.97 0.91

Score = Maximum Weighted Matching / (m – n + 1)

Page 17: Cross-Language Name Search Raghavendra UdupaMicrosoft Research India Mitesh KhapraIIT Bombay NAACL-HLT 2010 June 3, 2010 Improving the Multilingual User

Experimental Setup

•Name Directory:• English Wikipedia Titles• 6 Million Titles, 2 Million Unique Words

•Query Languages:• Russian, Hebrew, Kannada, Tamil, Hindi, Bengali• 1000 multi-word names in each language•Baseline:• State-of-the-art Machine Transliteration (NEWS 2009)

Page 18: Cross-Language Name Search Raghavendra UdupaMicrosoft Research India Mitesh KhapraIIT Bombay NAACL-HLT 2010 June 3, 2010 Improving the Multilingual User

Experimental Results

MRR0 1

Very Bad Perfect

Competitor GEOM-SEARCH

Algorithm Russian Kannada Tamil Hindi

TRANS-SEARCH 0.47 0.52 0.29 0.49

GEOM-SEARCH 0.56 0.69 0.49 0.69

Page 19: Cross-Language Name Search Raghavendra UdupaMicrosoft Research India Mitesh KhapraIIT Bombay NAACL-HLT 2010 June 3, 2010 Improving the Multilingual User

Conclusions

• Pros– Data driven: Easy to include new languages.– Not training data hungry: a few thousand parallel names

suffice.– Bridge languages are useful: feature space for (P,Q) can

be learnt using only data in (P,R) and (Q,R) (Khapra and Udupa, AAAI 2010)

– Fast search: ~1s for 6 M names directory – Applications:

• Cross-Language Wikipedia Search• Spelling Correction of Personal Names

Page 20: Cross-Language Name Search Raghavendra UdupaMicrosoft Research India Mitesh KhapraIIT Bombay NAACL-HLT 2010 June 3, 2010 Improving the Multilingual User

Raghavendra UdupaMicrosoft Research India

Mitesh Khapra IIT Bombay

NAACL-HLT 2010June 3, 2010

Improving the Multilingual User Experience of Wikipedia Using Cross-Language Name Search

Thank you!