improving distributional similarity with lessons learned from word embeddings
TRANSCRIPT
![Page 1: Improving Distributional Similarity with Lessons Learned from Word Embeddings](https://reader035.vdocuments.mx/reader035/viewer/2022070522/58ee57331a28ab43588b45a5/html5/thumbnails/1.jpg)
Improving Distributional Similaritywith Lessons Learned from Word Embeddings
Omer Levy, Yoav Goldberg, Ido Dagan
arXivTimes, 2017/1/24
Presenter: HironsanTwitter : Hironsan13GitHub : Hironsan
Transactions of the Association for Computational Linguistics, 2015
![Page 2: Improving Distributional Similarity with Lessons Learned from Word Embeddings](https://reader035.vdocuments.mx/reader035/viewer/2022070522/58ee57331a28ab43588b45a5/html5/thumbnails/2.jpg)
Abstract• word embedding の性能向上は Hyperparameters のチューニングによるところが大きいのではないかということで検証した論文• 4 つの手法について様々な Hyperparameters の組み合わせで検証• word similarity タスクにおいて、 count-base の手法でも
prediction-base の手法と同等の性能を示した• analogy タスクにおいては prediction-base 手法の方が優っていた
![Page 3: Improving Distributional Similarity with Lessons Learned from Word Embeddings](https://reader035.vdocuments.mx/reader035/viewer/2022070522/58ee57331a28ab43588b45a5/html5/thumbnails/3.jpg)
Word Similarity & Relatedness• How similar is pizza to pasta?• How related is pizza to Italy?
• 単語をベクトルとして表現することで類似度を計算できるようになる
![Page 4: Improving Distributional Similarity with Lessons Learned from Word Embeddings](https://reader035.vdocuments.mx/reader035/viewer/2022070522/58ee57331a28ab43588b45a5/html5/thumbnails/4.jpg)
Approaches for Representing Words
Word Embeddings (Predict)• Inspired by deep learning• word2vec (Mikolov et al., 2013)• GloVe (Pennington et al., 2014)
Underlying Theory: The Distributional Hypothesis (Harris, ’54)
“ ”似た単語は似た文脈で現れる
Distributional Semantics (Count)• Used since the 90’s• Sparse word-context PMI/PPMI matrix• Decomposed with SVD
![Page 5: Improving Distributional Similarity with Lessons Learned from Word Embeddings](https://reader035.vdocuments.mx/reader035/viewer/2022070522/58ee57331a28ab43588b45a5/html5/thumbnails/5.jpg)
Approaches for Representing WordsBoth approaches:• distributional hypothesis に頼っている• 使用するデータも同じ• 数学的にも関連している
• “Neural Word Embedding as Implicit MatrixFactorization” (NIPS 2014)
• どっちの手法の方が良いのだろうか?• “Don’t Count, Predict!” (Baroni et al., ACL 2014)• タイトルの通り Predict ベースを使うべきと主張。本当に?
![Page 6: Improving Distributional Similarity with Lessons Learned from Word Embeddings](https://reader035.vdocuments.mx/reader035/viewer/2022070522/58ee57331a28ab43588b45a5/html5/thumbnails/6.jpg)
The Contributions of Word Embeddings
Novel Algorithms(objective + training method)
• Skip Grams + Negative Sampling• CBOW + Hierarchical Softmax• Noise Contrastive Estimation• GloVe• …
どちらが性能を改善するのに重要なのか?
New Hyperparameters(preprocessing, smoothing, etc.)
• Subsampling• Dynamic Context Windows• Context Distribution Smoothing• Adding Context Vectors• …
![Page 7: Improving Distributional Similarity with Lessons Learned from Word Embeddings](https://reader035.vdocuments.mx/reader035/viewer/2022070522/58ee57331a28ab43588b45a5/html5/thumbnails/7.jpg)
Contributions1) Identifying the existence of new hyperparameters
• Not always mentioned in papers
2) Adapting the hyperparameters across algorithms• Must understand the mathematical relation between
algorithms
3) Comparing algorithms across all hyperparameter settings
• Over 5,000 experiments
![Page 8: Improving Distributional Similarity with Lessons Learned from Word Embeddings](https://reader035.vdocuments.mx/reader035/viewer/2022070522/58ee57331a28ab43588b45a5/html5/thumbnails/8.jpg)
Comparing Methods以下の 4 つの手法についてハイパーパラメータを変えて比較• PPMI(Positive Pointwise Mutual Information)
• SVD(Singular Value Decomposition)
• SGNS(Skip-Gram with Negative Sampling)
• GloVe(Global Vectors)
Count-base
Prediction-base
![Page 9: Improving Distributional Similarity with Lessons Learned from Word Embeddings](https://reader035.vdocuments.mx/reader035/viewer/2022070522/58ee57331a28ab43588b45a5/html5/thumbnails/9.jpg)
HyperparametersPreprocessing Hyperparameters
• Window Size(win)• Dynamic Context Window (dyn)• Subsampling (sub) • Deleting Rare Words (del)
Association Metric Hyperparameters• Shifted PMI (neg) • Context Distribution Smoothing (cds)
Post-processing Hyperparameters• Adding Context Vectors (w+c) • Eigenvalue Weighting (eig) • Vector Normalization (nrm)
![Page 10: Improving Distributional Similarity with Lessons Learned from Word Embeddings](https://reader035.vdocuments.mx/reader035/viewer/2022070522/58ee57331a28ab43588b45a5/html5/thumbnails/10.jpg)
Dynamic Context Windows•
![Page 11: Improving Distributional Similarity with Lessons Learned from Word Embeddings](https://reader035.vdocuments.mx/reader035/viewer/2022070522/58ee57331a28ab43588b45a5/html5/thumbnails/11.jpg)
Subsampling (sub) • Subsampling は頻出語を除去するために使う• 閾値 t 以上に頻出する単語を確率 p で除去する
• f は単語の頻度• t として設定する値は 10-3 〜 10-5
![Page 12: Improving Distributional Similarity with Lessons Learned from Word Embeddings](https://reader035.vdocuments.mx/reader035/viewer/2022070522/58ee57331a28ab43588b45a5/html5/thumbnails/12.jpg)
Deleting Rare Words (del) • 低頻度語の除去• context windows を作る前に除去する
![Page 13: Improving Distributional Similarity with Lessons Learned from Word Embeddings](https://reader035.vdocuments.mx/reader035/viewer/2022070522/58ee57331a28ab43588b45a5/html5/thumbnails/13.jpg)
Transferable HyperparametersPreprocessing Hyperparameters
• Window Size(win)• Dynamic Context Window (dyn)• Subsampling (sub) • Deleting Rare Words (del)
Association Metric Hyperparameters• Shifted PMI (neg) • Context Distribution Smoothing (cds)
Post-processing Hyperparameters• Adding Context Vectors (w+c) • Eigenvalue Weighting (eig) • Vector Normalization (nrm)
![Page 14: Improving Distributional Similarity with Lessons Learned from Word Embeddings](https://reader035.vdocuments.mx/reader035/viewer/2022070522/58ee57331a28ab43588b45a5/html5/thumbnails/14.jpg)
Shifted PMI (neg) • PMI から SGNS の Negative Sampling のパラメータ k を shift する
(Levy and Goldberg 2014)
SPPMI(w, c) = max(PMI(w, c) − log k, 0)• Negative Sampling のパラメータ k を PMI に適用している
![Page 15: Improving Distributional Similarity with Lessons Learned from Word Embeddings](https://reader035.vdocuments.mx/reader035/viewer/2022070522/58ee57331a28ab43588b45a5/html5/thumbnails/15.jpg)
Context Distribution Smoothing (cds)• negative sampling は単語のユニグラム分布 P をスムージングした分布を用いて行う
• PMI でも同様にスムージング
• こうすることで低頻度語の影響を緩和でき、かなり効果的
![Page 16: Improving Distributional Similarity with Lessons Learned from Word Embeddings](https://reader035.vdocuments.mx/reader035/viewer/2022070522/58ee57331a28ab43588b45a5/html5/thumbnails/16.jpg)
Transferable HyperparametersPreprocessing Hyperparameters
• Window Size(win)• Dynamic Context Window (dyn)• Subsampling (sub) • Deleting Rare Words (del)
Association Metric Hyperparameters• Shifted PMI (neg) • Context Distribution Smoothing (cds)
Post-processing Hyperparameters• Adding Context Vectors (w+c) • Eigenvalue Weighting (eig) • Vector Normalization (nrm)
![Page 17: Improving Distributional Similarity with Lessons Learned from Word Embeddings](https://reader035.vdocuments.mx/reader035/viewer/2022070522/58ee57331a28ab43588b45a5/html5/thumbnails/17.jpg)
Adding Context Vectors (w+c) • SGNS は単語ベクトル w を生成する• SGNS は文脈ベクトル c も生成する
• GloVe と SVD も• 単語を w で表さず、 w+c で表す by Pennington et al. (2014)
• 今までは GloVe だけに適応されてた
![Page 18: Improving Distributional Similarity with Lessons Learned from Word Embeddings](https://reader035.vdocuments.mx/reader035/viewer/2022070522/58ee57331a28ab43588b45a5/html5/thumbnails/18.jpg)
Eigenvalue Weighting (eig) • SVD で分解した結果得られた固有値の重み付け方法を変更する• SVD して W と C を作るが類似度タスクに最適な作り方とは限らない• symmetric な方法が semantic タスクには適している
Symmetric
![Page 19: Improving Distributional Similarity with Lessons Learned from Word Embeddings](https://reader035.vdocuments.mx/reader035/viewer/2022070522/58ee57331a28ab43588b45a5/html5/thumbnails/19.jpg)
Vector Normalization (nrm) • 得られた単語ベクトルは正規化
• 単位ベクトルにする
![Page 20: Improving Distributional Similarity with Lessons Learned from Word Embeddings](https://reader035.vdocuments.mx/reader035/viewer/2022070522/58ee57331a28ab43588b45a5/html5/thumbnails/20.jpg)
Transferable Hyperparameters• 各アルゴリズムにおける Hyperparameters の対応をとることで、なるべく条件を揃えて実験する
![Page 21: Improving Distributional Similarity with Lessons Learned from Word Embeddings](https://reader035.vdocuments.mx/reader035/viewer/2022070522/58ee57331a28ab43588b45a5/html5/thumbnails/21.jpg)
Comparing Algorithms
![Page 22: Improving Distributional Similarity with Lessons Learned from Word Embeddings](https://reader035.vdocuments.mx/reader035/viewer/2022070522/58ee57331a28ab43588b45a5/html5/thumbnails/22.jpg)
Systematic Experiments• 9 Hyperparameters
• 4 Word Representation Algorithms• PPMI (Sparse & Explicit)• SVD(PPMI)• SGNS• GloVe
• 8 Benchmarks• 6 Word Similarity Tasks• 2 Analogy Tasks
![Page 23: Improving Distributional Similarity with Lessons Learned from Word Embeddings](https://reader035.vdocuments.mx/reader035/viewer/2022070522/58ee57331a28ab43588b45a5/html5/thumbnails/23.jpg)
Hyperparameter SettingsClassic Vanilla Setting(commonly used for baselines)• Preprocessing
• <None>•
• Postprocessing• <None>
• Association Metric• Vanilla PMI/PPMI
Recommended word2vec Setting(tuned for SGNS)• Preprocessing
• Dynamic Context Window• Subsampling
• Postprocessing• <None>
• Association Metric• Shifted PMI/PPMI• Context Distribution
Smoothing
![Page 24: Improving Distributional Similarity with Lessons Learned from Word Embeddings](https://reader035.vdocuments.mx/reader035/viewer/2022070522/58ee57331a28ab43588b45a5/html5/thumbnails/24.jpg)
Experiments: Hyperparameter Tuning
PPMI (Sparse Vectors) SGNS (Embeddings)0.3
0.4
0.5
0.6
0.7
0.54 0.5870.688
0.6230.697 0.681
WordSim-353 Relatedness
Spea
rman
’s Co
rrel
ation
[optimal]
![Page 25: Improving Distributional Similarity with Lessons Learned from Word Embeddings](https://reader035.vdocuments.mx/reader035/viewer/2022070522/58ee57331a28ab43588b45a5/html5/thumbnails/25.jpg)
Overall Results• Hyperparameters のチューニングは時に algorithms より重要
• word similarity タスクにおいて、 count-base の手法でもprediction-base の手法と同等の性能を示した
• analogy タスクにおいては prediction-base 手法の方が優っていた
![Page 26: Improving Distributional Similarity with Lessons Learned from Word Embeddings](https://reader035.vdocuments.mx/reader035/viewer/2022070522/58ee57331a28ab43588b45a5/html5/thumbnails/26.jpg)
Conclusions
![Page 27: Improving Distributional Similarity with Lessons Learned from Word Embeddings](https://reader035.vdocuments.mx/reader035/viewer/2022070522/58ee57331a28ab43588b45a5/html5/thumbnails/27.jpg)
Conclusions: Distributional SimilarityWord Embeddings の性能には以下の 2 つが影響する :
• Novel Algorithms
• New Hyperparameters
何が性能にとって本当に重要なのか?• Hyperparameters (mostly)
• Algorithms
![Page 28: Improving Distributional Similarity with Lessons Learned from Word Embeddings](https://reader035.vdocuments.mx/reader035/viewer/2022070522/58ee57331a28ab43588b45a5/html5/thumbnails/28.jpg)
References• Omer Levy and Yoav Goldberg, 2014, Neural word
embeddings as implicit matrix factorization
• Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean, 2013, Efficient Estimation of Word Representations in Vector Space
• Jeffrey Pennington, Richard Socher, Christopher D. Manning, 2014, GloVe: Global Vectors for Word Representation
![Page 29: Improving Distributional Similarity with Lessons Learned from Word Embeddings](https://reader035.vdocuments.mx/reader035/viewer/2022070522/58ee57331a28ab43588b45a5/html5/thumbnails/29.jpg)
If you are interested inMachine Learning, Natural Language Processing, Computer Vision,Follow arXivTimes @ https://twitter.com/arxivtimes