深層意味表現学習 (deep semantic representations)

深層意味表現学習ボレガラ　ダヌシカ

英国リバープール大学　准教授

単語自身，意味を持っているか無いよ．周辺に現れる単語によって決まるだけ

J. R. Firth 1957

Image credit: www.odlt.org 2

“You shall know a word bythe company it keeps”

Quiz

• X は持ち歩く可能で，相手と通信ができて，ネットも見れて，便利だ．X は次の内どれ?

•犬

•飛行機

• iPhone

•バナナ

3

でもそれは本当？•だって辞書は単語の意味を定義しているじゃないか

•辞書も他の単語との関係を述べることで単語の意味を説明している．

•膨大なコーパスがあれば周辺単語を集めてくるだけで単語の意味表現が作れるので自然言語処理屋には嬉しい．

• practicalな意味表現手法

•色んなタスクに応用して成功しているので意味表現として（定量的に)は正しい

•単語の意味はタスクに依存する？

•どのタスクが良くて，どのタスクがダメなのか？

4

意味表現構築手法•分布的意味表現

•Distributional Semantic Representations

•単語xをコーパス中でその周辺に現れる全ての単語との共起頻度分布を持って表す.

•高次元,スパース

•古典的なアプローチ

•分散的意味表現

•Distributed Semantic Representations

•有数(10~1000)の次元/分布/クラスターの組み合わせ/混合として単語xの意味を表す.

•低次元,密

•深層学習/表現学習ブームで最近人気

5

意味表現を作るアプローチ•分布的意味表現


•単語xをコーパス中でその周辺に現れる全ての単語の共起頻度分布を持って表す.






•低次元,密


6

分布的意味表現構築

•「リンゴ」の単語の意味表現を作りなさい．

• S1=リンゴは赤い．

• S2=リンゴは美味しい．

• S3=青森県はリンゴの生産地として有名である．

7

分布的意味表現構築

•「リンゴ」の単語の意味表現を作りなさい．

• S1=リンゴは赤い．

• S2=赤いリンゴは美味しい．

• S3=青森県はリンゴの生産地として有名である．

リンゴ=[(赤い,2),(美味しい,1),(青森県,1),(生産地,1),(有名,1)]

8

応用例：意味的類似性計測

•「リンゴ」と「みかん」の意味的類似性を計測したい．

•まず，「みかん」の意味表現を作ってみる．

• S4=みかんはオレンジ色．

• S5=みかんは美味しい．

• S6=兵庫県はみかんの生産地として有名である．

9

みかん=[(オレンジ色,1),(美味しい,1),(兵庫県,1),(生産地,1),(有名,1)]

「リンゴ」と「みかん」

10

リンゴ=[(赤い,2),(美味しい,1),(青森県,1),(生産地,1),(有名,1)]

みかん=[(オレンジ色,1),(美味しい,1),(兵庫県,1),(生産地,1),(有名,1)]

両方の単語に対し，「美味しい」，「生産地」，「有名」といった共通な文脈語があるので「リンゴ」と「みかん」はかなり意味的に似ているといえる．

定量的に比較したければ集合同士の重なりとしてみれば良い Jaccard係数 = |リンゴ AND みかん| / |リンゴ OR みかん| |リンゴ AND みかん| = |{美味しい,生産地,有名}| = 3 |リンゴ OR みかん| =|{赤い,美味しい,青森県,生産地,有名,オレンジ色,兵庫県}| = 7 sim(リンゴ,みかん) = 3/7 = 0.4285

細かい工夫が多数•文脈として何を選ぶか

•文全体 (sentence-level co-occurrences)

•前後のn単語 (proximity window)

•係り受け関係にある単語 (dependencies)

•文脈の距離によって重みをつける．

•遠ければその共起の重みを距離分だけ減らす

•などなど

11

意味表現を作るアプローチ•分布的意味表現


•単語xをコーパス中でその周辺に現れる全ての単語の共起頻度分布を持って表す.






•低次元,密


12

局所的表現 vs. 分散表現

13

•  Clustering,!NearestJNeighbors,!RBF!SVMs,!local!nonJparametric!density!es>ma>on!&!predic>on,!decision!trees,!etc.!

•  Parameters!for!each!dis>nguishable!region!

•  #!dis>nguishable!regions!linear!in!#!parameters!

#2 The need for distributed representations Clustering!

16!

•  Factor!models,!PCA,!RBMs,!Neural!Nets,!Sparse!Coding,!Deep!Learning,!etc.!

•  Each!parameter!influences!many!regions,!not!just!local!neighbors!

•  #!dis>nguishable!regions!grows!almost!exponen>ally!with!#!parameters!

•  GENERALIZE+NON5LOCALLY+TO+NEVER5SEEN+REGIONS+

#2 The need for distributed representations

Mul>J!

Clustering!

17!

C1! C2! C3!

input!

ある点のラベルを決めるときに近隣する数個の点しか関与しない．

3個のパーテションで 8個の領域が定義される． (2nの表現能力)

slide credit: Yoshua Bengio

skip-gramモデル

14

私はみそ汁とご飯を頂いた

skip-gramモデル

15


私はみそ汁とご飯を頂いた形態素解析

各単語に対してd次元のベクトルが2個割り当てられている

単語xが意味表現学習対象となる場合のベクトルを対象語ベクトル v(x)といい，赤で示す．

xの周辺で現れる文脈単語cを文脈語ベクトルv(c)で表し，青で示す．

skip-gramモデル

16


私はみそ汁とご飯を頂いた形態素解析

v(x) v(c)

例えば「みそ汁」の周辺で「ご飯」が出現するかどうかを予測する問題を考えよう．

skip-gramモデル

17


私はみそ汁と ? を頂いた形態素解析

v(x) v(c)

c=ご飯, c’=ケーキとすると　(x=みそ汁, c=ご飯)という組み合わせの方が，(x=みそ汁, c’=ケーキ)より日本語としてもっともらしいという「意味」を反映させたv(x), v(c), v(c’) を学習したい．

skip-gramモデル

18



v(x) v(c)

提案1 この尤もらしさをベクトルの内積で定義しましょう. score(x,c) = v(x)Tv(c)

skip-gramモデル

19



v(x) v(c)

提案2 しかし,内積は(-∞,+∞)の値であり,正規化されていないので,都合が悪い．全ての文脈単語c’に関するスコアで割ることで確率にできる．

対数双線型•log-bilinear model

20

xの周辺でcが出現する確率

p(c|x) = exp(v(x)>v(c))Pc02V exp(v(x)>v(c0))

xとcの共起しやすさ

語彙集合(V)に含まれる全ての単語 c’とxが共起しやすさ

[Mnih+Hinton ICML’07]

何が凄いか•skip-gramで学習した単語の意味表現ベクトルを２次元で可視化すると

• v(king) - v(man) + v(woman) ≃ v(queen)21

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Country and Capital Vectors Projected by PCAChina

Japan

France

Russia

Germany

Italy

SpainGreece

Turkey

Beijing

Paris

Tokyo

Poland

Moscow

Portugal

Berlin

RomeAthens

Madrid

Ankara

Warsaw

Lisbon

Figure 2: Two-dimensional PCA projection of the 1000-dimensional Skip-gram vectors of countries and theircapital cities. The figure illustrates ability of the model to automatically organize concepts and learn implicitlythe relationships between them, as during the training we did not provide any supervised information aboutwhat a capital city means.

which is used to replace every logP (wO|wI) term in the Skip-gram objective. Thus the task is todistinguish the target word wO from draws from the noise distribution Pn(w) using logistic regres-sion, where there are k negative samples for each data sample. Our experiments indicate that valuesof k in the range 5–20 are useful for small training datasets, while for large datasets the k can be assmall as 2–5. The main difference between the Negative sampling and NCE is that NCE needs bothsamples and the numerical probabilities of the noise distribution, while Negative sampling uses onlysamples. And while NCE approximately maximizes the log probability of the softmax, this propertyis not important for our application.

Both NCE and NEG have the noise distributionPn(w) as a free parameter. We investigated a numberof choices for Pn(w) and found that the unigram distribution U(w) raised to the 3/4rd power (i.e.,U(w)3/4/Z) outperformed significantly the unigram and the uniform distributions, for both NCEand NEG on every task we tried including language modeling (not reported here).

2.3 Subsampling of Frequent Words

In very large corpora, the most frequent words can easily occur hundreds of millions of times (e.g.,“in”, “the”, and “a”). Such words usually provide less information value than the rare words. Forexample, while the Skip-gram model benefits from observing the co-occurrences of “France” and“Paris”, it benefits much less from observing the frequent co-occurrences of “France” and “the”, asnearly every word co-occurs frequently within a sentence with “the”. This idea can also be appliedin the opposite direction; the vector representations of frequent words do not change significantlyafter training on several million examples.

To counter the imbalance between the rare and frequent words, we used a simple subsampling ap-proach: each word wi in the training set is discarded with probability computed by the formula

P (wi) = 1−

!

t

f(wi)(5)

4

我々の研究成果

22

単語の意味は一意ではない•同じ単語でも使う場面において異なる意味を表すことがある．

•軽いノートPC (+)　　vs. 軽い男/女 (-)

•同じ単語に対し，複数の意味表現を学習しなければならない．[Neelakantan+ EMNLP-14]

•ある分野（ドメイン）で良く使われる意味を正確に予測しなければならない

•意味表現の分野適応 [Bollegala+ ACL-15]

23

ピボット (pivots)•異なるドメインで似たような意味を持つ単語(意味普遍な単語/semantic invariant)

•値段,形,安い,高い (excellent, cheap, digital)

•ピボットに関してはそれぞれのドメインにおける意味表現が近くなって欲しい．

•そうでない(non-pivot)単語に関しては,それぞれのドメインでピボットを予測できるようになって欲しい．

•イメージ：ピボットを介して,異なるドメインが近くになる．

24

損失関数•ranked hinge lossで損失を計測する．[Collobert + Weston ICML08]

•あるレビュー(口コミ)d中で出現しているpivotを使ってdに含まれているnon-pivotの予測スコアがd中に出現していないnon-pivotより高くなるように意味表現を調整する．

25

clear distinction between the source (i.e. the do-main on which we train) vs. the target (i.e. thedomain on which we test) domains, for represen-tation learning purposes we do not make a dis-tinction between the two domains. In the unsu-pervised setting of the cross-domain representa-tion learning that we study in this paper, we do notassume the availability of labeled data for any do-main for the purpose of learning word representa-tions. As an extrinsic evaluation task, we apply thetrained word representations for classifying senti-ment related to user-reviews (Section 3.5). How-ever, for this evaluation task we require sentiment-labeled user-reviews from the source domain. De-coupling of the word representation learning fromany tasks in which those representations are sub-sequently used, simplifies the problem as well asenables us to learn task-independent word repre-sentations with potential generic applicability. Al-though we limit the discussion to a pair of do-mains for simplicity, the proposed method can beeasily extended to jointly learn word representa-tions for more than two domains. In fact, priorwork on cross-domain sentiment analysis showthat incorporating multiple source domains im-proves sentiment classification accuracy on a tar-get domain (Bollegala et al., 2011; Glorot et al.,2011).

3.2 Proposed Method

To describe our proposed method, let us denote apivot and a non-pivot feature respectively by c andw. Our proposed method does not depend on aspecific pivot selection method, and can be usedwith all previously proposed methods for selectingpivots as explained later in Section 3.4. A pivotc is represented in the source and target domainsrespectively by vectors cS 2 Rn and cT 2 Rn.Likewise, a source specific non-pivot w is repre-sented by wS in the source domain, whereas a tar-get specific non-pivot w is represented by wT inthe target domain. By definition, a non-pivot oc-curs only in a single domain. For notational conve-nience we use w to denote non-pivots in both do-mains when the domain is clear from the context.We use CS , WS , CT , and WT to denote the setsof word representation vectors respectively for thesource pivots, source non-pivots, target pivots, andtarget non-pivots.

Let us denote the set of documents in the sourceand the target domains respectively by DS and

DT . Following the bag-of-features model, we as-sume that a document D is represented by the setof pivots and non-pivots that occur in D (w 2 d

and c 2 d). We consider the co-occurrencesof a pivot c and a non-pivot w within a fixed-size contextual window in a document. Followingprior work on representation learning (Mikolov etal., 2013a), in our experiments, we set the win-dow size to 10 tokens, without crossing sentenceboundaries. The notation (c, w) 2 d denotes theco-occurrence of a pivot c and a non-pivot w in adocument d.

We learn domain-specific word representationsby maximizing the prediction accuracy of the non-pivots w that occur in the local context of a pivotc. The hinge loss, L(CS ,WS), associated withpredicting a non-pivot w in a source documentd 2 DS that co-occurs with pivots c is given by

X

d2DS

X

(c,w)2d

X

w⇤⇠p(w)

max

⇣0, 1� cS

>wS + cS>w⇤

S

⌘.

(1)Here, w⇤

S is the source domain representation ofa non-pivot w⇤ that does not occur in d. The lossfunction given by Eq. 1 requires that a non-pivotw that co-occurs with a pivot c in the documentd is assigned a higher ranking score as measuredby the inner-product between cS and wS than anon-pivot w⇤ that does not occur in d. We ran-domly sample k non-pivots from the set of allsource domain non-pivots that do not occur ind as w

⇤. Specifically, we use the marginal dis-tribution of non-pivots p(w), estimated from thecorpus counts, as the sampling distribution. Weraise p(w) to the 3/4-th power as proposed byMikolov et al. (2013a), and normalize it to unitprobability mass prior to sampling k non-pivotsw

⇤ per each co-occurrence of (c, w) 2 d. Becausenon-occurring non-pivots w

⇤ are randomly sam-pled, prior work on noise contrastive estimationhas found that it requires more negative samplesthan positive samples to accurately learn a predic-tion model (Mnih and Kavukcuoglu, 2013). Weexperimentally found k = 5 to be an acceptabletrade-off between the prediction accuracy and thenumber of training instances.

Likewise, the loss function L(CT ,WT ) for pre-dicting non-pivots using pivots in the target do-main is given byX

d2DT

X

(c,w)2d

X

w⇤⇠p(w)

max

⇣0, 1� cT

>wT + cT>w⇤

T

⌘.

(2)

sourceドメインでpivot, cの意味表現

sourceドメインnon-pivot,

w,w*の意味表現. w∈d,

w*∉d

全体のロス関数

26

L(CS ,WS) =X

d2DS

X

(c,w)2d

X

w⇤⇠p(w)

max

�0, 1� cS

>wS + cS>w⇤

S�

L(CT ,WT ) =X

d2DT

X

(c,w)2d

X

w⇤⇠p(w)

max

�0, 1� cT

>wT + cT>w⇤

T�.

Here, w

⇤ denotes target domain non-pivots thatdo not occur in d, and are randomly sampledfrom p(w) following the same procedure as in thesource domain.

The source and target loss functions given re-spectively by Eqs. 1 and 2 can be used on their ownto independently learn source and target domainword representations. However, by definition, piv-ots are common to both domains. We use thisproperty to relate the source and target word repre-sentations via a pivot-regularizer, R(CS , CT ), de-fined as

R(CS , CT ) =

1

2

KX

i=1

||c(i)S � c(i)T ||2. (3)

Here, ||x|| represents the L2 norm of a vector x,and c

(i) is the i-th pivot in a total collection of Kpivots. Word representations for non-pivots in thesource and target domains are linked via the pivotregularizer because, the non-pivots in each domainare predicted using the word representations forthe pivots in each domain, which in turn are reg-ularized by Eq. 3. The overall objective function,L(CS ,WS , CT ,WT ), we minimize is the sum1 ofthe source and target loss functions, regularizedvia Eq. 3 with coefficient �, and is given by

L(CS ,WS , ) + L(CT ,WT ) + �R(CS , CT ). (4)

3.3 Training

Word representations of pivots c and non-pivots win the source (cS , wS) and the target (cT , wT ) do-mains are parameters to be learnt in the proposedmethod. To derive parameter updates, we computethe gradients of the overall loss function in Eq. 4w.r.t. to each parameter as follows:

@L

@wS=

(0 if cS

>(wS �w⇤

S) � 1

�cS otherwise(5)

@L

@w⇤S

=

(0 if cS

>(wS �w⇤

S) � 1

cS otheriwse(6)

@L

@wT=

(0 if cT

>(wT �w⇤

T ) � 1

�cT otherwise(7)

@L

@w⇤T

=

(0 if cT

>(wT �w⇤

T ) � 1

cT otherwise(8)

@L

@cS=

(�(cS � cT ) if cS

>(wS �w⇤

S) � 1

w⇤S �wS + �(cS � cT ) otherwise

(9)

1Weighting the source and target loss functions by the re-spective dataset sizes did not result in any significant increasein performance. We believe that this is because the bench-mark dataset contains approximately equal numbers of docu-ments for each domain.

@L

@cT=

(�(cT � cS) if cT

>(wT �w⇤

T ) � 1

w⇤T �wT + �(cT � cS) otherwise

(10)

Here, for simplicity, we drop the arguments insidethe loss function and write it as L. We use minibatch stochastic gradient descent with a batch sizeof 50 instances. Adaptive gradient (Duchi et al.,2011) is used to schedule the learning rate. Allword representations are initialized with n dimen-sional random vectors sampled from a zero meanand unit variance Gaussian. Although the objec-tive in Eq. 4 is not jointly convex in all four rep-resentations, it is convex w.r.t. the representationof a particular feature (pivot or non-pivot) whenthe representations for all the other features areheld fixed. In our experiments, the training con-verged in all cases with less than 100 epochs overthe dataset.

The rank-based predictive hinge loss (Eq. 1) isinspired by the prior work on word representa-tions learning for a single domain (Collobert etal., 2011). However, unlike the multilayer neu-ral network in Collobert et al. (2011), the pro-posed method uses a computationally efficient sin-gle layer to reduce the number of parameters thatmust be learnt, thereby scaling to large datasets.Similar to the skip-gram model (Mikolov et al.,2013a), the proposed method predicts occurrencesof contexts (non-pivots) w within a fixed-size con-textual window of a target word (pivot) c. Scor-ing the co-occurrences of two words c and w bythe bilinear form given by the inner-product issimilar to prior work on domain-insensitive word-representation learning (Mnih and Hinton, 2008;Mikolov et al., 2013a). However, unlike thosemethods that use the softmax function to convertinner-products to probabilities, we directly usethe inner-products without any further transforma-tions, thereby avoiding computationally expensivedistribution normalizations over the entire vocab-ulary.

3.4 Pivot Selection

Given two sets of documents DS , DT respec-tively for the source and the target domains, weuse the following procedure to select pivots andnon-pivots. First, we tokenize and lemmatize eachdocument using the Stanford CoreNLP toolkit2.Next, we extract unigrams and bigrams as features

2http://nlp.stanford.edu/software/

corenlp.shtml

Here, w

⇤ denotes target domain non-pivots thatdo not occur in d, and are randomly sampledfrom p(w) following the same procedure as in thesource domain.

The source and target loss functions given re-spectively by Eqs. 1 and 2 can be used on their ownto independently learn source and target domainword representations. However, by definition, piv-ots are common to both domains. We use thisproperty to relate the source and target word repre-sentations via a pivot-regularizer, R(CS , CT ), de-fined as

R(CS , CT ) =

1

2

KX

i=1

||c(i)S � c(i)T ||2. (3)

Here, ||x|| represents the L2 norm of a vector x,and c

(i) is the i-th pivot in a total collection of Kpivots. Word representations for non-pivots in thesource and target domains are linked via the pivotregularizer because, the non-pivots in each domainare predicted using the word representations forthe pivots in each domain, which in turn are reg-ularized by Eq. 3. The overall objective function,L(CS ,WS , CT ,WT ), we minimize is the sum1 ofthe source and target loss functions, regularizedvia Eq. 3 with coefficient �, and is given by

L(CS ,WS , ) + L(CT ,WT ) + �R(CS , CT ). (4)

3.3 Training

Word representations of pivots c and non-pivots win the source (cS , wS) and the target (cT , wT ) do-mains are parameters to be learnt in the proposedmethod. To derive parameter updates, we computethe gradients of the overall loss function in Eq. 4w.r.t. to each parameter as follows:

@L

@wS=

(0 if cS

>(wS �w⇤

S) � 1

�cS otherwise(5)

@L

@w⇤S

=

(0 if cS

>(wS �w⇤

S) � 1

cS otheriwse(6)

@L

@wT=

(0 if cT

>(wT �w⇤

T ) � 1

�cT otherwise(7)

@L

@w⇤T

=

(0 if cT

>(wT �w⇤

T ) � 1

cT otherwise(8)

@L

@cS=

(�(cS � cT ) if cS

>(wS �w⇤

S) � 1

w⇤S �wS + �(cS � cT ) otherwise

(9)

1Weighting the source and target loss functions by the re-spective dataset sizes did not result in any significant increasein performance. We believe that this is because the bench-mark dataset contains approximately equal numbers of docu-ments for each domain.

@L

@cT=

(�(cT � cS) if cT

>(wT �w⇤

T ) � 1

w⇤T �wT + �(cT � cS) otherwise

(10)

Here, for simplicity, we drop the arguments insidethe loss function and write it as L. We use minibatch stochastic gradient descent with a batch sizeof 50 instances. Adaptive gradient (Duchi et al.,2011) is used to schedule the learning rate. Allword representations are initialized with n dimen-sional random vectors sampled from a zero meanand unit variance Gaussian. Although the objec-tive in Eq. 4 is not jointly convex in all four rep-resentations, it is convex w.r.t. the representationof a particular feature (pivot or non-pivot) whenthe representations for all the other features areheld fixed. In our experiments, the training con-verged in all cases with less than 100 epochs overthe dataset.

The rank-based predictive hinge loss (Eq. 1) isinspired by the prior work on word representa-tions learning for a single domain (Collobert etal., 2011). However, unlike the multilayer neu-ral network in Collobert et al. (2011), the pro-posed method uses a computationally efficient sin-gle layer to reduce the number of parameters thatmust be learnt, thereby scaling to large datasets.Similar to the skip-gram model (Mikolov et al.,2013a), the proposed method predicts occurrencesof contexts (non-pivots) w within a fixed-size con-textual window of a target word (pivot) c. Scor-ing the co-occurrences of two words c and w bythe bilinear form given by the inner-product issimilar to prior work on domain-insensitive word-representation learning (Mnih and Hinton, 2008;Mikolov et al., 2013a). However, unlike thosemethods that use the softmax function to convertinner-products to probabilities, we directly usethe inner-products without any further transforma-tions, thereby avoiding computationally expensivedistribution normalizations over the entire vocab-ulary.

3.4 Pivot Selection

Given two sets of documents DS , DT respec-tively for the source and the target domains, weuse the following procedure to select pivots andnon-pivots. First, we tokenize and lemmatize eachdocument using the Stanford CoreNLP toolkit2.Next, we extract unigrams and bigrams as features

2http://nlp.stanford.edu/software/

corenlp.shtml

27

E−>B D−>B K−>B

55

60

65

70

75

80

Accu

racy

B−>E D−>E K−>E50

55

60

65

70

75

80

85

Accu

racy

B−>D E−>D K−>D

55

60

65

70

75

80

Accu

racy

NA GloVe SFA SCL CS Proposed

B−>K E−>K D−>K50

60

70

80

90

Accu

racy

Figure 1: Accuracies obtained by different methods for each source-target pair in cross-domain sentiment classification.

differences reported in Figure 1 can be directlyattributable to the domain adaptation, or word-representation learning methods compared. Allmethods use L2 regularized logistic regression asthe binary sentiment classifier, and the regulariza-tion coefficients are set to their optimal values onthe validation dataset. SFA, SCL, and CS use thesame set of 500 pivots as used by the proposedmethod selected using NPMI (Section 3.4). Di-mensionality n of the representation is set to 300for both GloVe and the proposed method. We pub-licly release the code and the data to facilitate fu-ture comparisons.3

From Fig. 1 we see that the proposed methodreports the highest classification accuracies in all12 domain pairs. Overall, the improvements ofthe proposed method over NA, GloVe, and CS

are statistically significant, and is comparable withSFA, and SCL. The proposed method’s improve-ment over CS shows the importance of predict-ing word representations instead of counting. Theimprovement over GloVe shows that it is inade-quate to simply apply existing word representa-tion learning methods to learn independent wordrepresentations for the source and target domains.We must consider the correspondences betweenthe two domains as expressed by the pivots tojointly learn word representations. As shown inFig. 2, the proposed method reports superior accu-racies over GloVe across different dimensionali-ties. Moreover, we see that when the dimensional-ity of the representations increases, initially accu-racies increase in both methods and saturates after200 � 600 dimensions. However, further increas-ing dimensionality results in unstable and some

3http://tinyurl.com/njnk9g8

Figure 2: Accuracy vs. dimensionality of the representation.

what poor accuracies due to overfitting when train-ing high-dimensional representations. Althoughour word representations learnt by the proposedmethod are not specific to sentiment classification,the fact that it clearly outperforms SFA and SCL

in all domain pairs is encouraging, and implies thewider-applicability of the proposed method for do-main adaptation tasks beyond sentiment classifica-tion.

5 Conclusion

We proposed an unsupervised method for learningcross-domain word representations using a givenset of pivots and non-pivots selected from a sourceand a target domain. Moreover, we proposed adomain adaptation method using the learnt wordrepresentations. Experimental results on a cross-domain sentiment classification task showed thatthe proposed method outperforms several compet-itive baselines and achieves best sentiment classi-fication accuracies for all domain pairs. In future,we plan to apply the proposed method to learn se-mantic variation of word-translations across lan-guages.

正しい意味表現を使い分けることで評判分類の性能があがる！

単語間の関係の表現学習•２つの単語の間に成立つ関係をどのように表現できるか．[Bollegala+ AAAI-15]

•単語はベクトルで表現できるなら２つの単語の間の関係が行列で表現できるはず．

•この「関係行列」はそれぞれの単語の意味表現からそれらの間の関係に寄与する属性のみを選択するものと解釈できる．

28

男

女

王

水

配

男

女

王

水

配

king queen

0 1 1 0 11 0 1 0 11 1 1 0 10 0 0 0 01 1 1 0 1

学習手法

29

we replace all patterns by a single pattern that indicates co-occurrence, ignoring the semantic relations.

Our work in this paper can be categorised as a prediction-based method for word representation learning. However,prior work in prediction-based word representation learn-ing have been limited to considering the co-occurrences be-tween two words, ignoring any semantic relations that ex-ist between the two words in their co-occurring context.On the other hand, prior studies on counting-based ap-proaches show that specific co-occurrences denoted by de-pendency relations are particularly useful for creating se-mantic representations for words (Baroni and Lenci 2010).Interestingly, prediction-based approaches have shown tooutperform counting-based approaches in comparable set-tings (Baroni et al. 2014). Therefore, it is natural for usto consider the incorporation of semantic relations betweenwords into prediction-based word representation learning.However, as explained in the previous section, three-wayco-occurrences of two words and semantic relations ex-pressed by contextual patterns are problematic due to datasparseness. Therefore, it is non-trivial to extend the existingprediction-based word representation methods to three-wayco-occurrences.

Methods that use matrices to represent adjectives (Baroniand Zamparelli 2010) or relations (Socher et al. 2013b) havebeen proposed. However, high dimensional representationsare often difficult to learn because of their computationalcost (Paperno et al. 2014). Although we learn matrix repre-sentations for patterns as a byproduct, our final goal is thevector representations for words. An interesting future re-search direction would be to investigate the possibilities ofusing the learnt matrix representations for related tasks suchas relation clustering (Duc et al. 2010).

Learning Word Representations fromRelational Graphs

Relational GraphsWe define a relational graph G(V, E) as a directed labelledweighted graph where the set of vertices V denotes wordsin the vocabulary, and the set of edges E denotes the co-occurrences between word-pairs and patterns. A pattern isa predicate of two arguments and expresses a semantic re-lation between the two words. Formally, an edge e 2 Econnecting two vertices u, v 2 V in the relational graphG is a tuple (u, v, l(e), w(e)), where l(e) denotes the labeltype corresponding to the pattern that co-occurs with the twowords u and v in some context, and w(e) denotes the co-occurrence strength between l(e) and the word-pair (u, v).Each word in the vocabulary is represented by a unique ver-tex in the relational graph and each pattern is represented bya unique label type. Because of this one-to-one correspon-dence between words and vertices, and patterns and labels,we will interchange those terms in the subsequent discus-sions. The direction of an edge e is defined such that thefirst slot (ie. X) matches with u, and the second slot (ie. Y)matches with v in a pattern l(e). Note that multiple edges canexist between two vertices in a relational graph correspond-ing to different patterns. Most manually created as well as

ostrich bird

penguin

X is a large Y [0.8]

X is a Y[0.7]

both X and Y are fligtless [0.5]

Figure 1: A relational graph between three words.

automatically extracted ontologies can be represented as re-lational graphs.

Consider the relational graph shown in Figure 1. For ex-ample, let us assume that we observed the context ostrich is

a large bird that lives in Africa in a corpus. Then, we ex-tract the lexical pattern X is a large Y between ostrich andbird from this context and include it in the relational graphby adding two vertices each for ostrich and bird, and an edgefrom ostrich to bird. Such lexical patterns have been used forrelated tasks such as measuring semantic similarity betweenwords (Bollegala et al. 2007). The co-occurrence strengthbetween a word-pair and a pattern can be computed using anassociation measure such as the positive pointwise mutualinformation (PPMI). Likewise, observation of the contextsboth ostrich and penguin are flightless birds and penguin is

a bird will result in the relational graph shown in Figure 1.

Learning Word RepresentationsGiven a relational graph as the input, we learn d dimen-sional vector representations for each vertex in the graph.The dimensionality d of the vector space is a pre-defined pa-rameter of the method, and by adjusting it one can obtainword representations at different granularities. Let us con-sider two vertices u and v connected by an edge with label land weight w. We represent the two words u and v respec-tively by two vectors x(u),x(v) 2 Rd, and the label l by amatrix G(l) 2 Rd⇥d. We model the problem of learning op-timal word representations ˆ

x(u) and pattern representationsˆG(l) as the solution to the following squared loss minimisa-tion problem

argminx(u)2Rd,G(l)2Rd⇥d

12

X

(u,v,l,w)2E

(x(u)>G(l)x(v)� w)2. (1)

The objective function given by Eq. 1 is jointly non-convex in both word representations x(u) (or alternativelyx(v)) and pattern representations G(l). However, if G(l) ispositive semidefinite, and one of the two variables is heldfixed, then the objective function given by Eq. 1 becomesconvex in the other variable. This enables us to use Alternat-ing Least Squares (ALS) (Boyd et al. 2010) method to solvethe optimisation problem. To derive the stochastic gradientdescent (SGD) updates for the parameters in the model, letus denote the squared loss associated with a single edgee = (u, v, l, w) in the relational graph G by E(e), given

xostrich=[:]

xostrich=[:]

xbird=[:]Glarge=[::]

Gfligtless=[::] Gis-a=[::]






ostrich bird

penguin


X is a Y[0.7]










12

X

(u,v,l,w)2E

(x(u)>G(l)x(v)� w)2. (1)


関係行列自乗誤差

単語の意味表現ベクトル共起の強さ u, v, l

最適化

30






ostrich bird

penguin


X is a Y[0.7]










12

X

(u,v,l,w)2E

(x(u)>G(l)x(v)� w)2. (1)


xostrich=[:]

xostrich=[:]

xbird=[:]Glarge=[::]

Gfligtless=[::] Gis-a=[::]

• 目的関数はそれぞれの変数x(u), G(l), x(v)に対し，非凸関数となっている．

• しかし，これらの変数のうちどれか２つを固定すれば残りの変数に関して凸関数となる．（但しG(l)は正定値行列でなければならない)

• 従って，目的関数をそれぞれの変数で偏微分し，確率的勾配法を使って最適化することができる．






ostrich bird

penguin


X is a Y[0.7]










12

X

(u,v,l,w)2E

(x(u)>G(l)x(v)� w)2. (1)


関係行列自乗誤差

単語の意味表現ベクトル共起の強さ u, v, l

アナロジー予測の性能

31

Method capital-common

capital-world

city-in-state

family (gender) currency overall

SVD+LEX 11.43 5.43 0 9.52 0 3.84

SVD+POS 4.57 9.06 0 29.05 0 6.57

SVD+DEP 5.88 3.02 0 0 0 1.11

CBOW 8.49 5.26 4.95 47.82 2.37 10.58

skip-gram 9.15 9.34 5.97 67.98 5.29 14.86

GloVe 4.24 4.93 4.35 65.41 0 11.89

Prop+LEX 22.87 31.42 15.83 61.19 25.0 26.61

Prop+POS 22.55 30.82 14.98 60.48 20.0 25.35

Prop+DEP 20.92 31.40 15.27 56.19 20.0 24.68

単語から関係を導出

•v(king) - v(man)はkingとmanの間の関係を表わしているはず．そうでなければ，類推問題が解けない（関係類似性が計測できない）

•ならば，特定の関係で結ばれている単語同士の意味表現ベクトルの差分をとれば関係の表現が作れるはず．[Bollegala+ IJCAI-15]

32

trich is a large bird that primarily lives in Africa, we caninfer that the semantic relation IS-A-LARGE exists betweenostrich and bird. Prior work on relational similarity measure-ment have successfully used such lexical patterns as featuresto represent the semantic relations that exist between twowords [Duc et al., 2010; 2011]. According to the relationalduality hypothesis [Bollegala et al., 2010], a semantic rela-tion R can be expressed either extensionally by enumerat-ing word-pairs for which R holds, or intensionally by statinglexico-syntactic patterns that define the properties of R.

Following these prior work, we extract lexical patternsfrom the co-occurring contexts of two words to represent thesemantic relations between those two words. Specifically, weextract unigrams and bigrams of tokens as patterns from themidfix (i.e. the sequence of tokens that appear in betweenthe given two words in a context) [Bollegala et al., 2007b;2007a]. Although we use lexical patterns as features for repre-senting semantic relations in this work, our proposed methodis not limited to lexical patterns, and can be used in prin-ciple with any type of features that represent relations. Thestrength of association between a word pair (u, v) and a pat-tern p is measured using the positive pointwise mutual infor-mation (PPMI), f(p, u, v), which is defined as follows,

f(p, u, v) = max(0, log

✓g(p, u, v)g(⇤, ⇤, ⇤)g(p, ⇤, ⇤)g(⇤, u, v)

◆). (1)

Here, g(p, u, v) denotes the number of co-occurrences be-tween p and (u, v), and ⇤ denotes the summation taken overall words (or patterns) corresponding to the slot variable. Werepresent a pattern p by the set R(p) of word-pairs (u, v) forwhich f(p, u, v) > 0. Formally, we define R(p) and its norm|R(p)| as follows,

R(p) = {(u, v)|f(p, u, v) > 0} (2)

|R(p)| =

X

(u,v)2R(p)

f(p, u, v) (3)

We represent a word x using a vector x 2 Rd. The dimen-sionality of the representation, d, is a hyperparameter of theproposed method. Prior work on word representation learn-ing have observed that the difference between the vectors thatrepresent two words closely approximates the semantic re-lations that exist between those two words. For example, thevector v(king)�v(queen) has shown to be similar to the vec-tor v(man) � v(woman). We use this property to represent apattern p by a vector p 2 Rd as the weighted sum of dif-ferences between the two words in all word-pairs (u, v) thatco-occur with p as follows,

p =

1

|R(p)|X

(u,v)2R(p)

f(p, u, v)(u� v). (4)

For example, consider Fig. 1, where the two word-pairs(lion, cat), and (ostrich, bird) co-occur respectively withthe two lexical patterns, p1 = large Ys such as Xs, andp2 = X is a huge Y. Assuming that there are no other co-occurrences between word-pairs and patterns in the corpus,the representations of the patterns p1 and p2 are given respec-tively by p1 = x1 � x2, and p2 = x3 � x4. We measure the

Figure 1: Computing the similarity between two patterns.

relational similarity between (x1, x2) and (x3, x4) using theinner-product p1

>p2.

We model the problem of learning word representations asa binary classification task, where we learn representationsfor words such that they can be used to accurately predictwhether a given pair of patterns are relationally similar. Inour previous example, we would learn representations for thefour words lion, cat, ostrich, and bird such that the similaritybetween the two patterns large Ys such as Xs, and X is a hugeY is maximized. Later in Section 3.1, we propose an unsuper-vised method for selecting relationally similar (positive) anddissimilar (negative) pairs of patterns as training instances totrain a binary classifier.

Let us denote the target label for two patterns p1, p2 byt(p1, p2) 2 {1, 0}, where the value 1 indicates that p1 andp2 are relationally similar, and 0 otherwise. We compute theprediction loss for a pair of patterns (p1, p2) as the squaredloss between the target and the predicted labels as follows,

L(t(p1, p2), p1, p2) =

1

2

(t(p1, p2) � �(p1>

p2))2. (5)

Different non-linear functions can be used as the predictionfunction �(·) such as the logistic-sigmoid, hyperbolic tan-gent, or rectified linear units. In our preliminary experimentswe found hyperbolic tangent, tanh, given by

�(✓) = tanh(✓) =exp(✓)� exp(�✓)

exp(✓) + exp(�✓)(6)

to work particularly well among those different non-linearities.

To derive the update rule for word representations, let usconsider the derivative of the loss w.r.t. the word representa-tion x of a word x,

@L

@x=

@L

@p1

@p1

@x+

@L

@p2

@p2

@x, (7)

where the partial derivative of the loss w.r.t. pattern represen-tations are given by,

@L

@p1

= �0(p1

>p2)(�(p1

>p2)� t(p1, p2))p2, (8)

@L

@p2

= �0(p1

>p2)(�(p1

>p2)� t(p1, p2))p1. (9)

意味表現学習

33

3 Learning Word RepresentationsThe local context in which two words co-occur provides use-ful information regarding the semantic relations that exist be-tween those two words. For example, from the sentence Os-trich is a large bird that primarily lives in Africa, we caninfer that the semantic relation IS-A-LARGE exists betweenostrich and bird. Prior work on relational similarity measure-ment have successfully used such lexical patterns as featuresto represent the semantic relations that exist between twowords [Turney, 2006]. According to the relational duality hy-

pothesis

[Bollegala et al., 2010], a semantic relation R canbe expressed either extensionally by enumerating word-pairsfor which R holds, or intensionally by stating lexico-syntacticpatterns that define the properties of R.

Following these prior work, we extract lexical patternsfrom the co-occurring contexts of two words to represent thesemantic relations between those two words. Specifically, weextract unigrams and bigrams of tokens as patterns from themidfix (i.e. the sequence of tokens that appear in betweenthe given two words in a context). Although we use lexi-cal patterns as features for representing semantic relations inthis work, our proposed method is not limited to lexical pat-terns, and can be used in principle with any type of featuresthat represent relations. The strength of association betweena word pair (u, v) and a pattern p is measured using the pos-itive pointwise mutual information (PPMI), f(p, u, v), whichis defined as follows,


✓g(p, u, v)g(⇤, ⇤, ⇤)g(p, ⇤, ⇤)g(⇤, u, v)

◆). (1)


R(p) = {(u, v)|f(p, u, v) > 0} (2)

|R(p)| =

X

(u,v)2R(p)

f(p, u, v) (3)


p =

1

|R(p)|X

(u,v)2R(p)

f(p, u, v)(u� v). (4)

For example, consider Fig. 1, where the two word-pairs(lion, cat), and (ostrich, bird) co-occur respectively withthe two lexical patterns, p1 = large Ys such as Xs, and

x1 x2

p1

1

x3 x4

p2

lion cat ostrich bird

large Ys such as Xs X is a huge Y

f(p1, x1, x2)

�(p1>

p2)

-f(p1, x1, x2) f(p2, x3, x4) -f(p2, x3, x4)


p2 = X is a huge Y. Assuming that there are no other co-occurrences between word-pairs and patterns in the corpus,the representations of the patterns p1 and p2 are given respec-tively by p1 = x1 � x2, and p2 = x3 � x4. We measure therelational similarity between (x1, x2) and (x3, x4) using theinner-product p1

>p2.

We model the problem of learning word representations asa binary classification task, where we learn representationsfor words such that they can be used to accurately predictwhether a given pair of patterns are relationally similar. Inour previous example, we would learn representations for thefour words lion, cat, ostrich, and bird such that the similaritybetween the two patterns large Ys such as Xs, and X is a huge

Y is maximized. Later in Section 3.1, we propose an unsuper-vised method for selecting relationally similar (positive) anddissimilar (negative) pairs of patterns as training instances totrain a binary classifier.


L(t(p1, p2), p1, p2) =

1

2

(t(p1, p2) � �(p1>

p2))2. (5)


�(✓) = tanh(✓) =exp(✓)� exp(�✓)

exp(✓) + exp(�✓)(6)



@L

@x=

@L

@p1

@p1

@x+

@L

@p2

@p2

@x, (7)

where the partial derivative of the loss w.r.t. pattern represen-


pothesis




✓g(p, u, v)g(⇤, ⇤, ⇤)g(p, ⇤, ⇤)g(⇤, u, v)

◆). (1)


R(p) = {(u, v)|f(p, u, v) > 0} (2)

|R(p)| =

X

(u,v)2R(p)

f(p, u, v) (3)


p =

1

|R(p)|X

(u,v)2R(p)

f(p, u, v)(u� v). (4)


x1 x2

p1

1

x3 x4

p2



f(p1, x1, x2)

�(p1>

p2)

-f(p1, x1, x2) f(p2, x3, x4) -f(p2, x3, x4)



>p2.




L(t(p1, p2), p1, p2) =

1

2

(t(p1, p2) � �(p1>

p2))2. (5)


�(✓) = tanh(✓) =exp(✓)� exp(�✓)

exp(✓) + exp(�✓)(6)



@L

@x=

@L

@p1

@p1

@x+

@L

@p2

@p2

@x, (7)


語彙パターンの集合として関係を表現


pothesis




✓g(p, u, v)g(⇤, ⇤, ⇤)g(p, ⇤, ⇤)g(⇤, u, v)

◆). (1)


R(p) = {(u, v)|f(p, u, v) > 0} (2)

|R(p)| =

X

(u,v)2R(p)

f(p, u, v) (3)


p =

1

|R(p)|X

(u,v)2R(p)

f(p, u, v)(u� v). (4)


x1 x2

p1

1

x3 x4

p2



f(p1, x1, x2)

�(p1>

p2)

-f(p1, x1, x2) f(p2, x3, x4) -f(p2, x3, x4)



>p2.




L(t(p1, p2), p1, p2) =

1

2

(t(p1, p2) � �(p1>

p2))2. (5)


�(✓) = tanh(✓) =exp(✓)� exp(�✓)

exp(✓) + exp(�✓)(6)



@L

@x=

@L

@p1

@p1

@x+

@L

@p2

@p2

@x, (7)

where the partial derivative of the loss w.r.t. pattern represen-uとvの間の関係をそれらの意味表現ベクトルの「引き算」で与える


pothesis




✓g(p, u, v)g(⇤, ⇤, ⇤)g(p, ⇤, ⇤)g(⇤, u, v)

◆). (1)


R(p) = {(u, v)|f(p, u, v) > 0} (2)

|R(p)| =

X

(u,v)2R(p)

f(p, u, v) (3)


p =

1

|R(p)|X

(u,v)2R(p)

f(p, u, v)(u� v). (4)


x1 x2

p1

1

x3 x4

p2



f(p1, x1, x2)

�(p1>

p2)

-f(p1, x1, x2) f(p2, x3, x4) -f(p2, x3, x4)



>p2.




L(t(p1, p2), p1, p2) =

1

2

(t(p1, p2) � �(p1>

p2))2. (5)


�(✓) = tanh(✓) =exp(✓)� exp(�✓)

exp(✓) + exp(�✓)(6)



@L

@x=

@L

@p1

@p1

@x+

@L

@p2

@p2

@x, (7)


tations are given by,

@L

@p1

= �0(p1

>p2)(�(p1

>p2)� t(p1, p2))p2, (8)

@L

@p2

= �0(p1

>p2)(�(p1

>p2)� t(p1, p2))p1. (9)

Here, �0 denotes the first derivative of tanh, which is givenby 1��(✓)

2. To simplify the notation we drop the argumentsof the loss function.

From Eq. 4 we get,

@p1

@x=

1

|R(p1)|(h(p1, u = x, v)� h(p1, u, v = x)) , (10)

@p2

@x=

1

|R(p2)|(h(p2, u = x, v)� h(p2, u, v = x)) , (11)

where,

h(p, u = x, v) =

X

(x,v)2{(u,v)|(u,v)2R(p),u=x}

f(p, x, v),

and

h(p, u, v = x) =

X

(u,x)2{(u,v)|(u,v)2R(p),v=x}

f(p, u, x).

Substituting the partial derivatives given by Eqs. 8-11 inEq. 7 we get,

@L

@x= �(p1, p2)[H(p1, x)

X

(u,v)2R(p2)

f(p2, u, v)(u� v)

+H(p2, x)

X

(u,v)2R(p1)

f(p1, u, v)(u � v)], (12)

where �(p1, p2) is defined as

�(p1, p2) =�0(p1

>p2)(t(p1, p2)� �(p1

>p2))

|R(p1)||R(p2)|, (13)

and H(p, x) is defined as

H(p, x) = h(p, u = x, v)� h(p, u, v = x). (14)

We use stochastic gradient decent (SGD) with learning rateadapted by AdaGrad [Duchi et al., 2011] to update the wordrepresentations. The pseudo code for the proposed method isshown in Algorithm 1. Given a set of N relationally similarand dissimilar pattern-pairs, {(p(i)

1 , p(i)2 , t(p

(i)1 , p

(i)2 )}N

i=1, Al-gorithm 1 initializes each word x

j

in the vocabulary with avector x

j

2 Rd. The initialization can be conducted eitherusing randomly sampled vectors from a zero mean and unitvariance Gaussian distribution, or by pre-trained word rep-resentations. In our preliminary experiments, we found thatthe word vectors learnt by GloVe [Pennington et al., 2014]to perform consistently well over random vectors when usedas the initial word representations in the proposed method.Because word vectors trained using existing word representa-tions already demonstrate a certain degree of relational struc-ture with respect to proportional analogies, we believe thatinitializing using pre-trained word vectors assists the subse-quent optimization process.

Algorithm 1 Learning word representations.

Input: Training pattern-pairs {(p(i)1 , p

(i)2 , t(p

(i)1 , p

(i)2 )}N

i=1,dimensionality d of the word representations, and themaximum number of iterations T .

Output: Representation x

j

2 Rd, of a word xj

for j =

1, . . . , M , where M is the vocabulary size.

1: Initialize word vectors {xj

}Mj=1.

2: for t = 1 to T do3: for k = 1 to K do4: p

k

=

1|R(pk)|

P(u,v)2R(pk) f(p

k

, u, v)(u � v)

5: end for6: for i = 1 to N do7: for j = 1 to M do8: x

j

= x

j

� ↵(t)j

@L

@xj

9: end for10: end for11: end for12: return {x

j

}Mj=1.

During each iteration, Algorithm 1 alternates between twosteps. First, in Lines 3-5, it computes pattern representationsusing Eq. 4 from the current word representations for all thepatterns (K in total) in the training dataset. Second, in Lines6-10, for each train pattern-pair we compute the derivative ofthe loss according to Eq. 12, and update the word represen-tations. These two steps are repeated for T iterations, afterwhich the final set of word representations are returned.

The computational complexity of Algorithm 1 isO(TKd + TNMd), where d is the dimensionality ofthe word representations. Naively iterating over N traininginstances and M words in the vocabulary can be prohibitivelyexpensive for large training datasets and vocabularies. How-ever, in practice we can efficiently compute the updatesusing two tricks: delayed updates and indexing. Once wehave computed the pattern representations for all K patternsin the first iteration, we can postpone the update of arepresentation for a pattern until that pattern next appearsin a training instance. This reduces the number of patternsthat are updated in each iteration to a maximum of 2 insteadof K for the iterations t > 1. Because of the sparseness inco-occurrences, only a handful (ca. 100) of patterns co-occurwith any given word-pair. Therefore, by pre-compiling anindex from a pattern to the words with which that patternco-occurs, we can limit the update of word representationsin Line 8 to a much smaller number than M . Moreover, thevector subtraction can be parallized across the dimensions.Although the loss function defined by Eq. 5 is non-convexw.r.t. to word representations, in practice, Algorithm 1converges after a few (less than 5) iterations. In practice,it requires less than an hour to train from a 2 billion wordcorpus where we have N = 100, 000, T = 10, K = 10, 000

and M = 210, 914.Lexical patterns contain sequences of multiple words.

Therefore, exact occurrences of lexical patterns are rare com-pared to that of individual words even in large corpora. Di-rectly learning representations for lexical patterns using their

アナロジー予測の性能

34

co-occurrence statistics leads to data sparseness issues, whichbecomes problematic when applying existing methods pro-posed for learning representations for single words to learnrepresentations for lexical patterns that consist of multiplewords. The proposal made in Eq. 4 to compute representa-tions for patterns circumvent this data sparseness issue by in-directly modeling patterns through word representations.

3.1 Selecting Similar/Dissimilar Pattern-PairsWe use the ukWaC corpus1 to extract relationally similar(positive) and dissimilar (negative) pairs of patterns (p

i

, pj

)

to train the proposed method. The ukWaC is a 2 billion wordcorpus constructed from the Web limiting the crawl to the .ukdomain. We select word-pairs that co-occur at least in 50 sen-tences within a co-occurrence window of 5 tokens. Moreover,using a stop word list, we ignore word-pairs that purely con-sists of stop words. We obtain 210, 914 word-pairs from thisstep. Next, we extract lexical patterns for those word-pairsby replacing the first and second word in a word-pair respec-tively by slot variables X and Y in a co-occurrence window oflength 5 tokens to extract numerous lexical patterns. We selectthe top occurring 10, 000 lexical patterns (i.e. K = 10, 000)for further processing.

We represent a pattern p by a vector where the elementscorrespond to the PPMI values f(p, u, v) between p and allthe word-pairs (u, v) that co-occur with p. Next, we com-pute the cosine similarity between all pairwise combinationsof the 10, 000 patterns, and rank the pattern pairs in the de-scending order of their cosine similarities. We select the topranked 50, 000 pattern-pairs as positive training instances. Weselect 50, 000 pattern-pairs from the bottom of the list whichhave non-zero similarity scores as negative training instances.The reason for not selecting pattern-pairs with zero similar-ity scores is that such patterns do not share any word-pairs incommon, and are not informative as training data for updat-ing word representations. Thus, the total number of traininginstances we select is N = 50, 000 + 50, 000 = 100, 000.

4 Evaluating Word Representations usingProportional Analogies

To evaluate the ability of the proposed method to learn wordrepresentations that embed information related to semanticrelations, we apply it to detect proportional analogies. Forexample, consider the proportional analogy, man:woman ::king:queen. Given, the first three words, a word represen-tation learning method is required to find the fourth wordfrom the vocabulary that maximizes the relational similar-ity between the two word-pairs in the analogy. Three bench-mark datasets have been popularly used in prior work forevaluating analogies: Google dataset [Mikolov et al., 2013c](10, 675 syntactic analogies and 8869 semantic analogies),SemEval dataset [Jurgens et al., 2012] (79 questions), andSAT dataset [Turney, 2006] (374 questions). For the Googledataset, the set of candidates for the fourth word consists ofall the words in the vocabulary. For the SemEval and SATdatasets, each question word-pair is assigned with a limited

1http://wacky.sslmit.unibo.it

Table 1: Word analogy results on benchmark datasets.Method sem. synt. total SAT SemEvalivLBL CosAdd 63.60 61.80 62.60 20.85 34.63ivLBL CosMult 65.20 63.00 64.00 19.78 33.42ivLBL PairDiff 52.60 48.50 50.30 22.45 36.94skip-gram CosAdd 31.89 67.67 51.43 29.67 40.89skip-gram CosMult 33.98 69.62 53.45 28.87 38.54skip-gram PairDiff 7.20 19.73 14.05 35.29 43.99CBOW CosAdd 39.75 70.11 56.33 29.41 40.31CBOW CosMult 38.97 70.39 56.13 28.34 38.19CBOW PairDiff 5.76 13.43 9.95 33.16 42.89GloVe CosAdd 86.67 82.81 84.56 27.00 40.11GloVe CosMult 86.84 84.80 85.72 25.66 37.56GloVe PairDiff 45.93 41.23 43.36 44.65 44.67Prop CosAdd 86.70 85.35 85.97 29.41 41.86Prop CosMult 86.91 87.04 86.98 28.87 39.67Prop PairDiff 41.85 42.86 42.40 45.99 44.88

number of candidate word-pairs out of which only one is cor-rect. The accuracy of a word representation is evaluated bythe percentage of the correctly answered analogy questionsout of all the questions in a dataset. We do not skip any ques-tions in our evaluations.

Given a proportional analogy a : b :: c : d, we use thefollowing measures proposed in prior work for measuring therelational similarity between (a, b) and (c, d).CosAdd proposed by Mikolov et al. [2013d] ranks candi-

dates c according to the formula

CosAdd(a :b, c :d) = cos(b� a+ c,d), (15)

and selects the top-ranked candidate as the correct an-swer.

CosMult: CosAdd measure can be decomposed into thesummation of three cosine similarities, where in practiceone of the three terms often dominates the sum. To over-come this bias in CosAdd, Levy and Goldberg [2014]proposed the CosMult measure given by,

CosMult(a :b, c :d) =cos(b,d) cos(c,d)

cos(a,d) + ✏. (16)

We convert all cosine values x 2 [�1, 1] to positive val-ues using the transformation (x+1)/2. Here, ✏ is a smallconstant value to prevent denominator becoming zero,and is set to 10

�5 in the experiments.PairDiff measures the cosine similarity between the two

vectors that correspond to the difference of the wordrepresentations of the two words in each word-pair. Itfollows from our hypothesis that the semantic relationbetween two words can be represented by the vector dif-ference of their word representations. PairDiff has beenused by Mikolov et al. [2013d] for detecting semanticanalogies and is given by,

PairDi↵(a :b, c :d) = cos(b� a,d� c). (17)

5 Experiments and ResultsIn Table 1, we compare the proposed method against pre-viously proposed word representation learning methods:

コーパス　vs. 辞書•コーパスさえあれば単語（関係）の分散的意味表現が学習できる．

•しかし，既に人間が長年かけて作った「辞書」というもので単語の意味が定義されている

•この両方を使うことでより正確な意味表現が学習できないか．[Bollegala+ AAAI-15]

•特に，コーパスが不完全な場合，辞書(オントロジー）が役立つ

•私は犬と猫が好きだ．

35

JointReps

•コーパス中で同一文内に出現する単語を予測する．その際に生じる誤差（目的関数）を最小化する．

•辞書(WordNet)で定義されている意味的関係を制約として入れる．

36

words. However, they do not consider the semantic relationsbetween words and only consider words that are listed as re-lated in the BabelNet, which encompasses multiple seman-tic relations. Bollegala et al. [2014] proposed a method forlearning word representations from a relational graph, wherethey represent words and relations respectively by vectorsand matrices. Their method can be applied on either a man-ually created relational graph, or an automatically extractedone from data. However, during training they use only therelational graph and do not use the corpus.

3 Learning Word RepresentationsGiven a corpus C, and a semantic lexicon S , we describea method for learning word representations wi 2 Rd forwords wi in the corpus. We use the boldface wi to denotethe word (vector) representation of the i-th word wi, andthe vocabulary (i.e., the set of all words in the corpus) isdenoted by V . The dimensionality d of the vector repre-sentation is a hyperparameter of the proposed method thatmust be specified by the user in advance. Any semantic lexi-con that specifies the semantic relations that exist betweenwords could be used as S , such as the WordNet [Miller,1995], FrameNet [Baker et al., 1998], or the ParaphraseDatabase [Ganitkevitch et al., 2013]. In particular, we donot assume any structural properties unique to a particularsemantic lexicon. In the experiments described in this paperwe use the WordNet as the semantic lexicon.

Following Pennington et al. [2014], first we create a co-occurrence matrix X in which words that we would like tolearn representations for (target words) are arranged in rowsof X, whereas words that co-occur with the target wordsin some contexts (context words) are arranged in columnsof X. The (i, j)-th element Xij of X is set to the total co-occurrences of i and j in the corpus. Following the rec-ommendations in prior work on word representation learn-ing [Levy et al., 2015], we set the context window to the 10tokens preceding and succeeding a word in a sentence. Wethen extract unigrams from the co-occurrence windows asthe corresponding context words. We down-weight distant(and potentially noisy) co-occurrences using the reciprocal1/l of the distance in tokens l between the two words thatco-occur.

A word wi is assigned two vectors wi and w̃i denotingwhether wi is respectively the target of the prediction (cor-responding to the rows of X), or in the context of anotherword (corresponding to the columns of X). The GloVe ob-jective can then be written as:

JC =

1

2

X

i2V

X

j2V

f(Xij)

⇣wi

>˜wj + bi +˜bj � log(Xij)

⌘2

(1)

Here, bi and b̃j are real-valued scalar bias terms that adjustfor the difference between the inner-product and the loga-rithm of the co-occurrence counts. The function f discountsthe co-occurrences between frequent words and is given by:

f(t) =

((t/t

max

)

↵ if t < tmax

1 otherwise(2)

Following Pennington et al. [2014], we set ↵ = 0.75 andtmax

= 100 in our experiments. The objective function de-fined by (1) encourages the learning of word representationsthat demonstrate the desirable property that vector differ-ence between the word embeddings for two words representsthe semantic relations that exist between those two words.For example, Mikolov et al. [2013c] observed that the dif-ference between the word embeddings for the words kingand man when added to the word embedding for the wordwoman yields a vector similar to that of queen.

Unfortunately, the objective function given by (1) doesnot capture the semantic relations that exist between wi andwj as specified in the lexicon S . Consequently, it considersall co-occurrences equally and is likely to encounter prob-lems when the co-occurrences are rare. To overcome thisproblem we propose a regularizer, J

S

, by considering thethree-way co-occurrence among words wi, wj , and a seman-tic relation R that exists between the target word wi and oneof its context words wj in the lexicon as follows:

JS =

1

2

X

i2V

X

j2V

R(i, j) (wi � ˜wj)2 (3)

Here, R(i, j) is a binary function that returns 1 if the se-mantic relation R exists between the words wi and wj inthe lexicon, and 0 otherwise. In general, semantic relationsare asymmetric. Thus, we have R(i, j) 6= R(j, i). Experi-mentally, we consider both symmetric relation types, suchas synonymy and antonymy, as well as asymmetric relationtypes, such as hypernymy and meronymy. The regularizergiven by (3) enforces the constraint that the words that areconnected by a semantic relation R in the lexicon must havesimilar word representations.

We would like to learn target and context word represen-tations wi, w̃j that simultaneously minimize both (1) and(3). Therefore, we formulate the joint objective as a mini-mization problem as follows:

J = JC + �JS (4)

Here, � 2 R+ is a non-negative real-valued regularizationcoefficient that determines the influence imparted by the se-mantic lexicon on the word representations learnt from thecorpus. We use development data to estimate the optimalvalue of � as described later in Section 4.

The overall objective function given by (4) is non-convexw.r.t. to the four variables wi, w̃j , bi, and b̃j . However, ifwe fix three of those variables, then (4) becomes convex inthe remaining one variable. We use an alternative optimiza-tion approach where we first randomly initialize all the pa-rameters, and then cycle through the set of variables in apre-determined order updating one variable at a time whilekeeping the other variables fixed.

The derivatives of the objective function w.r.t. the vari-

words. However, they do not consider the semantic relationsbetween words and only consider words that are listed as re-lated in the BabelNet, which encompasses multiple seman-tic relations. Bollegala et al. [2014] proposed a method forlearning word representations from a relational graph, wherethey represent words and relations respectively by vectorsand matrices. Their method can be applied on either a man-ually created relational graph, or an automatically extractedone from data. However, during training they use only therelational graph and do not use the corpus.

3 Learning Word RepresentationsGiven a corpus C, and a semantic lexicon S , we describea method for learning word representations wi 2 Rd forwords wi in the corpus. We use the boldface wi to denotethe word (vector) representation of the i-th word wi, andthe vocabulary (i.e., the set of all words in the corpus) isdenoted by V . The dimensionality d of the vector repre-sentation is a hyperparameter of the proposed method thatmust be specified by the user in advance. Any semantic lexi-con that specifies the semantic relations that exist betweenwords could be used as S , such as the WordNet [Miller,1995], FrameNet [Baker et al., 1998], or the ParaphraseDatabase [Ganitkevitch et al., 2013]. In particular, we donot assume any structural properties unique to a particularsemantic lexicon. In the experiments described in this paperwe use the WordNet as the semantic lexicon.

Following Pennington et al. [2014], first we create a co-occurrence matrix X in which words that we would like tolearn representations for (target words) are arranged in rowsof X, whereas words that co-occur with the target wordsin some contexts (context words) are arranged in columnsof X. The (i, j)-th element Xij of X is set to the total co-occurrences of i and j in the corpus. Following the rec-ommendations in prior work on word representation learn-ing [Levy et al., 2015], we set the context window to the 10tokens preceding and succeeding a word in a sentence. Wethen extract unigrams from the co-occurrence windows asthe corresponding context words. We down-weight distant(and potentially noisy) co-occurrences using the reciprocal1/l of the distance in tokens l between the two words thatco-occur.

A word wi is assigned two vectors wi and w̃i denotingwhether wi is respectively the target of the prediction (cor-responding to the rows of X), or in the context of anotherword (corresponding to the columns of X). The GloVe ob-jective can then be written as:

JC =

1

2

X

i2V

X

j2V

f(Xij)

⇣wi

>˜wj + bi +˜bj � log(Xij)

⌘2

(1)

Here, bi and b̃j are real-valued scalar bias terms that adjustfor the difference between the inner-product and the loga-rithm of the co-occurrence counts. The function f discountsthe co-occurrences between frequent words and is given by:

f(t) =

((t/t

max

)

↵ if t < tmax

1 otherwise(2)

Following Pennington et al. [2014], we set ↵ = 0.75 andtmax

= 100 in our experiments. The objective function de-fined by (1) encourages the learning of word representationsthat demonstrate the desirable property that vector differ-ence between the word embeddings for two words representsthe semantic relations that exist between those two words.For example, Mikolov et al. [2013c] observed that the dif-ference between the word embeddings for the words kingand man when added to the word embedding for the wordwoman yields a vector similar to that of queen.

Unfortunately, the objective function given by (1) doesnot capture the semantic relations that exist between wi andwj as specified in the lexicon S . Consequently, it considersall co-occurrences equally and is likely to encounter prob-lems when the co-occurrences are rare. To overcome thisproblem we propose a regularizer, J

S

, by considering thethree-way co-occurrence among words wi, wj , and a seman-tic relation R that exists between the target word wi and oneof its context words wj in the lexicon as follows:

JS =

1

2

X

i2V

X

j2V

R(i, j) (wi � ˜wj)2 (3)

Here, R(i, j) is a binary function that returns 1 if the se-mantic relation R exists between the words wi and wj inthe lexicon, and 0 otherwise. In general, semantic relationsare asymmetric. Thus, we have R(i, j) 6= R(j, i). Experi-mentally, we consider both symmetric relation types, suchas synonymy and antonymy, as well as asymmetric relationtypes, such as hypernymy and meronymy. The regularizergiven by (3) enforces the constraint that the words that areconnected by a semantic relation R in the lexicon must havesimilar word representations.

We would like to learn target and context word represen-tations wi, w̃j that simultaneously minimize both (1) and(3). Therefore, we formulate the joint objective as a mini-mization problem as follows:

J = JC + �JS (4)

Here, � 2 R+ is a non-negative real-valued regularizationcoefficient that determines the influence imparted by the se-mantic lexicon on the word representations learnt from thecorpus. We use development data to estimate the optimalvalue of � as described later in Section 4.

The overall objective function given by (4) is non-convexw.r.t. to the four variables wi, w̃j , bi, and b̃j . However, ifwe fix three of those variables, then (4) becomes convex inthe remaining one variable. We use an alternative optimiza-tion approach where we first randomly initialize all the pa-rameters, and then cycle through the set of variables in apre-determined order updating one variable at a time whilekeeping the other variables fixed.

The derivatives of the objective function w.r.t. the vari-

単語間の意味的類似性計測

37

Table 1: Performance of the proposed method with different semantic relation types.Method RG MC RW SCWS MEN sem syn total SemEvalcorpus only 0.7523 0.6398 0.2708 0.460 0.6933 61.49 66.00 63.95 37.98Synonyms 0.7866 0.7019 0.2731 0.4705 0.7090 61.46 69.33 65.76 38.65Antonyms 0.7694 0.6417 0.2730 0.4644 0.6973 61.64 66.66 64.38 38.01Hypernyms 0.7759 0.6713 0.2638 0.4554 0.6987 61.22 68.89 65.41 38.21Hyponyms 0.7660 0.6324 0.2655 0.4570 0.6972 61.38 68.28 65.15 38.30Member-holonyms 0.7681 0.6321 0.2743 0.4604 0.6952 61.69 66.36 64.24 37.95Member-meronyms 0.7701 0.6223 0.2739 0.4611 0.6963 61.61 66.31 64.17 37.98Part-holonyms 0.7852 0.6841 0.2732 0.4650 0.7007 61.44 67.34 64.66 38.07Part-meronyms 0.7786 0.6691 0.2761 0.4679 0.7005 61.66 67.11 64.63 38.29

tactic (syn) analogies, and 8869 semantic analogies (sem).The SemEval dataset contains manually ranked word-pairsfor 79 word-pairs describing various semantic relation types,such as defective, and agent-goal. In total there are 3218word-pairs in the SemEval dataset. Given a proportionalanalogy a : b :: c : d, we compute the cosine similarity be-tween b�a+c and c, where the boldface symbols representthe embeddings of the corresponding words. For the Googledataset, we measure the accuracy for predicting the fourthword d in each proportional analogy from the entire vocab-ulary. We use the binomial exact test with Clopper-Pearsonconfidence interval to test for the statistical significance ofthe reported accuracy values. For SemEval we use the offi-cial evaluation tool3 to compute MaxDiff scores.

In Table 1, we compare the word embeddings learnt bythe proposed method for different semantic relation typesin the WordNet. All word embeddings compared in Ta-ble 1 are 300 dimensional. We use the WordSim-353 (WS)dataset [Finkelstein et al., 2002] as validation data to findthe optimal value of � for each relation type. Specifically,we minimize (4) for different � values, and use the learntword representations to measure the cosine similarity for theword-pairs in the WS dataset. We then select the value of �that gives the highest Spearman correlation with the humanratings on the WS dataset. This procedure is repeated sepa-rately with each semantic relation type R. We found that �values greater than 10000 to perform consistently well on allrelation types. The level of performance if we had used onlythe corpus for learning word representations (without usinga semantic lexicon) is shown in Table 1 as the corpus onlybaseline. This baseline corresponds to setting � = 0 in (4).

From Table 1, we see that by incorporating most of the se-mantic relations found in the WordNet we can improve overthe corpus only baseline. In particular, the improvements re-ported by synonymy over the corpus only baseline is statis-tically significant on RG, MC, SCWS, MEN, syn, and Se-mEval. Among the individual semantic relations, synonymyconsistently performs well on all benchmarks. Among theother relations, part-holonyms and member-holonyms per-form best respectively for predicting semantic similarity be-tween rare words (RW), and for predicting semantic analo-gies (sem) in the Google dataset. Meronyms and holonymsare particularly effective for predicting semantic similar-ity between rare words. This result is important because itshows that a semantic lexicon can assist the representation

3https://sites.google.com/site/semeval2012task2/

Table 2: Comparison against prior work.Method RG MEN sem synRCM 0.471 0.501 - 29.9R-NET - - 32.64 43.46C-NET - - 37.07 40.06RC-NET - - 34.36 44.42Retro (CBOW) 0.577 0.605 36.65 52.5Retro (SG) 0.745 0.657 45.29 65.65Retro (corpus only) 0.786 0.673 61.11 68.14Proposed (synonyms) 0.787 0.709 61.46 69.33

learning of rare words, among which the co-occurrences aresmall even in large corpora [Luong et al., 2013], The factthat the proposed method could significantly improve per-formance on this task empirically justifies our proposal forusing a semantic lexicon in the word representation learningprocess. Table 1 shows that not all relation types are equallyuseful for learning word representations for a particular task.For example, hypernyms and hyponyms report lower scorescompared to the corpus only baseline on predicting seman-tic similarity for rare (RW) and ambiguous (SCWS) word-pairs.

In Table 2, we compare the proposed method against pre-viously proposed word representation learning methods thatuse a semantic lexicon: RCM is the relational constrainedmodel proposed by Yu and Dredze [2014], R-NET, C-NET,and RC-NET are proposed by Xu et al. [2014], and respec-tively use relational information, categorical information,and their union from the WordNet for learning word rep-resentations, and Retro is the retrofitting method proposedby Faruqui et al. [2015]. Details of those methods are de-scribed in Section 2. For Retro, we use the publicly avail-able implementation4 by the original authors, and use pre-trained word representations on the same ukWaC corpus asused by the proposed method. Specifically, we retrofit wordvectors produced by CBOW (Retro (CBOW)), and skip-gram (Retro (SG)). Moreover, we retrofit the word vectorslearnt by the corpus only baseline (Retro (corpus only))to compare the proposed joint learning approach to thepost-processing approach in retrofitting. Unfortunately, forRCM, R-NET, C-NET, and RC-NET their implementa-tions, nor trained word vectors were publicly available. Con-sequently, we report the published results for those methods.In cases where the result on a particular benchmark dataset

4https://github.com/mfaruqui/retrofitting

人間が付けた類似度スコアとアルゴリズムが出した類似度とのSpearman相関を使って評価.

様々な意味的関係を制約として使える．類義語関係(synonymy)が最も有効．

残された難題

•単語の共起を予測するのが意味表現を学習するための最適なタスクなのか．

•単語の意味表現ベクトルがなす空間について何もしらない．

•そもそもベクトルで十分なのかさえ分からない

•文や文書の意味をどう表すか．（構成的意味論）

•多言語，曖昧性をどう扱うか．

38

39

御免 - sorry + thanks = 有難う

Danushka Bollegalawww.csc.liv.ac.uk/[email protected]@Bollegala

http://www.iba.t.u-tokyo.ac.jp

mailto:[email protected]

深層意味表現学習 (deep semantic representations)

Science