semantic relatedness for all (languages): a comparative analysis of multilingual semantic...

Semantic Relatedness for All (Languages): A Comparative

Analysis of Multilingual Semantic Relatedness using

Machine Translation Andre Freitas, Siamak Barzegar, Juliano Efson

Sales , Siegfried Handschuh and Brian Davis21th November 2016 @EKAW

Definition of DSM:

- Distributional semantics is built upon the assumption that the context surrounding a given word in a text provides important information about its meaning

- From question answering systems, to semantic search and text entailment.

2

Motivation:

● Distributional semantic models are strongly dependent on the size and the quality of the reference corpora

● high-quality texts containing large-scale commonsense information, such as Wikipedia, are present in English

3

Goal:

Multilingual Distributional Semantics Aspect: - Finding a solution which maximizes

solubility for Multilingual Scenarios - How to transport DSMs to other

languages with a lower volume of corpus data.

4

Research Questions

How perform different distributional semantic models:- In different languages- With different sizes- In computing semantic relatedness and

relatedness tasks

What is the role of machine translation approach:- To support the construction of better distributional

vectors- For computing semantic similarity and relatedness

measures5

Proposed Approach

6the localised datasets for each language underwent a linguistic quality assurance by a well know localisation company.

12

Research Methodology

● The experimental set-up consists of the instantiation of four distributional semantic models - Explicit Semantic Analysis (ESA), Latent Semantic Analysis (LSA),, Word2Vector (W2V) and Global Vectors (GloVe)

● In 11 different languages - English, German, French, Italian, Spanish, Portuguese, Dutch, Russian, Swedish, Arabic and Farsi - using Wikipedia (January 2015) as a corpus.

● For the experiment the vector dimensions for LSA, W2V and Glove were set to 300 dimensions

● Each distributional model was evaluated for the task of computing semantic similarity and relatedness measures for each word pair using three human annotated gold standards: Miller & Charles (MC), Rubenstein & Goodenough (RG) and WordSimilarity 353 (WS-353).

● Two automatic machine translation approaches were evaluated: the Google Translate Service and the Microsoft Bing Translation Service.

7

Evaluation● Question 1: Does machine translation to English perform better

than the word vectors in the original language (for which languages and for which distributional semantic models)?

● Question 2: Which DSMs or MT-DSMs work best for the set of analyzed languages?

8

Language-specific Models

Machine Translation Models

Evaluation:● Question 1: Does machine translation to English perform better

than the word vectors in the original language (for which languages and for which distributional semantic models)? Machine translation to English consistently performs better for all languages, with the exception of German, which presents equivalent results for the language-specific models. The MT approach provides an average improvement of 16.7%.

● Question 2: Which DSMs or MT-DSMs work best for the set of analyzed languages?

● W2V- MT consistently performs as the best model for all datasets and languages with the exception of German, in which the difference between MT W2V and language-specific W2V is not significant.

9

Evaluation (Cont):

● Question 3: What is the quality of state-of-the-art machine translation approaches for word-pairs?

10

Avg = 59 %

Conclusions:

- Significant improvement (average of 16.7% for the Spearman correlation).

- W2V showed consistently better results.- The average accuracy of the machine translation

approach is 59%.- Combination of W2V(English) and machine

translation as the best configuration at this point for the construction of multi-lingual distributional semantic models.

11

Question Fragen

سوال

AnswerAntworten

جواب

?

semantic relatedness for all (languages): a comparative analysis of multilingual semantic...

Science