kazuya kawakamik-kawakami.com/static/pdf/cmu_160429.pdf · lti @ carnegie mellon university...

LTI @ Carnegie Mellon University

Multilingual and Multimodal word representation

Kazuya Kawakami


Contents

2

Multilingual Word Representation

Multimodal Word Representation


Contents

3




Word Representation

• Word representation is a way to encode word semantics into vector space.- They capture syntactic and semantic similarity as geometric locality.- Analogy: King - Man + Woman = Queen

4

http://www.socher.org/ Mikolov et al. 2013

Astronomy

Color


Word Representation

• Word representation is useful for downstream tasks.- Use pre-trained word embedding as initialization of parameters.- It works well for various types of tasks.

5

POS-Tagging (Collobert et al) Parsing (Dyer et al.)

Machine Translation


Constructing Word Representation

• Distributed Word Representations are learned by - collecting concurrence statistics.- predicting surrounding context .

6

</s>

dogstudywork

shopping

appleorange

cat

….

you

hard

Lookup Table Type Level Embedding

work


Improving Word Representation

• Multilingual shared representation (Faruqui et al., 2014, Gouws et al., 2014) - Good representation capture monolingual profile and translation profile.

7

EN Monolingual FR Monolingual Shared Space


Improving Word Representation

• Multilingual shared representation with CCA (Faruqui et al., 2014). - Post Processing with alignment data .

8


Constructing Bilingual Representation

• Multilingual shared representation with joint modeling (Gouws et al., 2014 )- Joint modeling of Monolingual and Bilingual model without alignment.

9


Problem of Type Level Word Representation

• Type level embedding are not context sensitive .

10

I went to the bank to deposit my paycheck.

I went to the river bank to eat some lunch.


Token Level Representation

• To build a context sensitive model, we need to have a function of word and context.- What kind of supervisory signal is needed to tune ?

11

semantic mask word in context

Green plant

type level word representation

×


Context Word

✓

✓


Multilingual Information

• Translation disambiguate senses.

12

I went to the bank to deposit my paycheck.

I went to the river bank to eat some lunch.

Je suis allée à la banque pour déposer mon chèque de paie.

Je suis allé sur la rive pour le déjeuner.



• To build a context sensitive model, we need to have a function of word and context.- What kind of supervisory signal is needed to tune ?

13


Green plant


×

Context Word

✓

✓

[ FR ] Plante


• Assuming that we have multiple senses in a single vector- We need to learn mask out or scale each elements. - The mask vector need to be sensitive to context.

What kind of function is needed ?

14


Green plant


×

WordContext


Model

• Input gate of LSTM serve to control input. Possibly mask out unnecessary elements.

15

Workshop track - ICLR 2016

This is done by performing a softmax over the target vocabulary with the representation of the wordh

t

, as defined in the previous section. That is, we compute

u = Rh

t

+ b

0

p(f | et

, c) =exp(u

f

)Pf

02F exp(uf

0)

,

where parameters R and b

0 define the projection of the source word with context representation h

t

onto the target vocabulary F .

To obtain pairs of words in context and their lexical translations into a second language, we use un-supervised word alignment techniques (Dyer et al., 2013), to obtain high precision word alignmentsfrom a parallel corpus. While modeling alignments as latent variables, or using a soft attentionmechanism would be a reasonable alternative, word alignment is fast and the proposed trainingobjective to be easily scaled to large corpora.

Figure 2 illustrates the pre-training architecture.

<s> The plant grows </s>…..

plante

entreprise

Word in Context ( )

~~~ ~

Lexical Translation ( FR )

h

t

Figure 2: Description of cross lingual pre-training model.

3.2 PARAMETER LEARNING

The model parameters W and b as well as the word projection parameters Ve

are first pre-trainedwith the objective function:

L = �

X

(f ,e)

log p(f | e, c)

That is, we wish to find the parameters that maximize the lexical translation log probability over thewhole parallel corpus of lexical translations (f ) of a source word (e) in context (c).

When we want to transfer the model to another supervised task to predict label s 2 S for a word e incontext c, the final values of the W and b parameters are transferred and formulate a similar modelto predict label s. Using the transformation matrix S 2 R|S|⇥dh and the biases b00

2 R|S|, we maydefine the label probability as

u

0= Sh

t

+ b

00

p(s | et

, c) =exp(u0

s

)Ps

02S exp(u0s

0).

the model is training by maximizing the log likelihood of the observed label in the task.

L

0= �

X

(s,e)

log p(s | e, c) (1)

4

i

t

= �(Wxi

x

t

+W

hi

h

t�1 +W

ci

c

t�1 + b

i

)

f

t

= �(Wxf

x

t

+W

hf

h

t�1 +W

cf

c

t�1 + b

f

)

g

t

= tanh(Wxc

x

t

+W

hc

h

t�1 + b

c

)

c

t

= f

t

� c

t�1 + i

t

� g

t

o

t

= �(Wxo

x

t

+W

ho

h

t�1 +W

co

c

t

+ b

o

)

h

t

= o

t

� tanh(ct

)

FR


Multilingual Pre-Training

• We pertained our model on - EN-FR / DE / CS / FI parallel corpus.

• Pre-trained Model achieved the following perplexity.

16

Dev Perplexity

FR 3.80

DE 6.59

CS 6.30

FI 19.25


Evaluation & Result

• We evaluated our model on various types of tasks.

17

Supersense Tagging

Lexical Substitution

Word sense disambiguation

Model is fine-tuned with large supervised data.

Transfer learning scenario.

Model is fine-tuned with small supervised data.

Unsupervised scenario.

Model is just trained on parallel corpus.

Low Resource Machine Translation


Experiment & Result


18

Supersense Tagging










Supersense

• Supersense is a generalization of NE tags to common nouns- 26 Noun Supersenses and 15 Verb Supersenses

19


• Sequence Labeling Task which require to disambiguate senses.

• Semcor dataset and Semeval3 dataset were used. Semcor: 250,000 words of text from the Brown Corpus and a novel.

Supersense Tagging

20

artifact creation


Result

• MLP and LSTM achieved state-of-the-art result. - Cross Lingual Pre-Training provide further improvement across all languages.

21


Experiment & Result


22

Supersense Tagging










Experiment

• Machine Translation in Low Resource Language- How pre-trained features are transferable to other languages?

• Malagasy and Urdu translation dataset

• Fine tune the network with small supervision

• Train baseline MT model with cdec.- Add translation probability and log translation probability - Optimized parameters of MT model with MIRA.

23

Malagasy / Urdu


Result

• Four additional features show marginal improvement across all language pairs. - The scores are average of 10 runs to deal with randomness.

24


Experiment & Result


25

Supersense Tagging










Lexical substitution

• Identifying meaning-preserving substitutes for a target word given a context.- Purely evaluate the quality of token representation in unsupervised setting.

• Semeval-2007 Shared Task, Task 10.- the most widely used for the evaluation of lexical substitution.- 10 sentences extracted from a web corpus for each of 201 target words.

26

This will help the younger generation to know and understand better its neighbors living just on the other ||| side ||| of the Mediterranean .

Gattlinburg , Tennessee is a touristy town to stay in on the north ||| side ||| of the park .

boundary fringe position edge

Annotation Count 2 1 1 1

bank flank divide shore edge

Annotation Count 1 1 1 1 1


Baselines

• Skipgram Word Vectors trained only on EN side of EN-FR Parallel Corpus.- 50 Million Tokens.

• Dependency-Based Word Embeddings- Word2vec with Dependency (Levy et al. 2014). - Trained on 2 Billion web corpus. - Various context sensitive models were proposed (Levy et al. 2015).

27


Result

• Our model outperformed baseline model & the model trained on large dataset- Improved precision score by 2 points for best and best mode evaluation.- The improvement is consistent across all languages.

28


Evaluation & Result


29

Supersense Tagging










Analysis

30

They built a large plant to manufacture automobiles.

Let’s plant flowers in the garden. plantes planter végétal cultiver

usine installation plante centrale


Analysis

• Effect of Cross Lingual Pre-Training- Pre-training provide good initialization for both tasks.

31

Super Sense Tagging Low-Resource MT (MG)


Contents

32





• Multimodal word representation has been investigated intensively.- Image caption generation with Attention (Xu et al. 2015)- Image generation from caption (Elman et al. 2016)

33

A group of people sitting on a boat in the water.



• Multimodal word representation has been investigated intensively.- Image caption generation with Attention (Xu et al. 2015)- Image generation from caption (Elman et al. 2016)

34


Color

• We focused on specific type of visual modality: Color.

35


Word-Color Association

• Word-Color Association has been investigated in cognitive science and psychology.

• Stroop Effect (Stroop 1935)“ Which is Red ? “

36




• Stroop Effect (Stroop 1935) “ Which is Red ? “

37

Red Blue




• Stroop Effect (Stroop 1935) “ Which is Red ? “

• Brain information processing (del Prado Martinetal. 2006; Simmons et al. 2007)- Usually our brain use different regions to perceive various modalities. - Visual cortex: vision - Broca’s area: language- Color words activates the same brain regions as color is perceived.

• Word-Color have association with polarity (Mohammad. 2014)

38

Red Blue



39

color = F (word)


Dataset

• We collected 776k sample from a design website COLORLovers- Users give name to RGB Values.- The dataset contains 776,364 pairs with 581,483 unique word name.

40


Color Regression Model

• Character Level LSTM

• Embedding Dimension: 300d, Hidden Dimension: 300d.

• Loss Function

41

i

t

= �(Wxi

x

t

+W

hi

h

t�1 +W

ci

c

t�1 + b

i

)

f

t

= �(Wxf

x

t

+W

hf

h

t�1 +W

cf

c

t�1 + b

f

)

g

t

= tanh(Wxc

x

t

+W

hc

h

t�1 + b

c

)

c

t

= f

t

� c

t�1 + i

t

� g

t

o

t

= �(Wxo

x

t

+W

ho

h

t�1 +W

co

c

t

+ b

o

)

h

t

= o

t

� tanh(ct

)

ht = [�!ht;

�!ct ]

L = ||y � y||2

y = �(Wh+ b) W 2 R3⇥600,b 2 R3


Color Space

• Lab Space is suitable for gradient base learning.

42

RGB Space Lab Space

- Discrete Space- RED, GREEN BLUE (255, 255, 255)- Addition, Subtraction is valid

- Continuous Space- L, a, b- Euclidian Distance -> Human perception

L represents lightness, a the position between red/magenta and green, and b the position between yellow and blue.


Dataset

• We collected 776k sample from a design website COLORLovers- Users give name to RGB Values.- The dataset contains 776,364 pairs with 581,483 unique word name.

• The color names (no-overlap with training data) used in - R Language: 66 Unique Name-Color Pair- Paint company Sherwin-Williams: 956 Unique Name-Color Pair

43


Result

• LSTM with Memory and BiLSTM, DeepLSTM equally performed well.- Significant improvement over Unigram, Bigram model.

44

MSE Error on Each Dataset.


Color Turing Test

• Human evaluation- Which color is better described by the term RED ?- 100 Turkers who are not color blind.

• Human selected model prediction for R, Paint Datasets (less noisy data).

45

Examples of Questionnaire Examples of Color Blind Test


Analysis: Character

• Character by character prediction show how our model read words.

46


DEMO

47

http://colorlab.us


Analysis: Leaned Representation

• 2D-PCA Plot of learned representation.

48


• Not all of the word have association with Color.- What type of word have strong association with color ?

• Colorized all words in wordnet (117,659 unique words).- Strength of color association is defined as distance from mean color, beige.

Analysis: Word Color Association

49

Mean Distance from mean prediction for each POS, NOUN/VERB Supersenses


Analysis: Text Analysis

• What type of Text is Colorful ?- Wikipedia: 2769 randomly selected wikipedia articles- News: 6000 articles from Reuters News.- Poem: 3300 English poems from 1800 to today.- Beer Review: 6000 review from beeradvocate.com.- Recipe: 256 recipes that list ingredients and steps.

• Mean distance from RGB(128, 128, 128)

50


Analysis: Vision

51

• What happen if we use color of word to find object form images?- For noun chunks, we first predict color of word.- Take images which have the noun phrase in caption.- Computer Pixel-wise distance from color of word.


Analysis: Vision

52

Oranges


Analysis: Vision

53

Bananas


Analysis: Vision

54

A yellow bag


Analysis: Vision

55

The grass


Acknowledgement

56

Chris Dyer Noah Smith Bryan Routledge


Thank you very much !

kazuya kawakamik-kawakami.com/static/pdf/cmu_160429.pdf · lti @ carnegie mellon university...

Documents