learning a nonlinear embedding by preserving class neibourhood structure 최종
TRANSCRIPT
Learning a Nonlinear Embedding by Preserving Class Neighbourhood Structure
AISTATS `07 San Juan, Puerto RicoSalakhutdinov Ruslan, and Geoffrey E. Hinton.
Presenter:
WooSung Choi([email protected])
DataKnow. LabKorea UNIV.
Background
(k-) Nearest Neighbor Query
kNN(k-Nearest Neighbor) Query
0 x
y
kNN(k-Nearest Neighbor) Classifi-cation
Salakhutdinov, Ruslan, and Geoffrey E. Hinton. "Learning a nonlinear embedding by preserving class neighbourhood
structure." International Conference on Artificial Intelligence and Sta-tistics. 2007.
NN Class1-NN 6
2-NN 6
3-NN 6
4-NN 6
5-NN 0
<Result of 5-NN>
Result of 5-NN Classification: 6 (80%)
Motivating Example• MNIST
Dimensionality: 28 x 28 = 78450,000 training images10,000 test images
• Error: 2.77%• Query response: 108ms
Reality Check• Curse of dimensionality
[Qin lv et al, Image Similarity Search with Compact Data Structures @CIKM`04]
poor performance when the number of dimensions is high
Roger Weber et al, A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces @ VLDB`98
Locality Sensitive Hashing, Data Sensitive Hashing
Curse of Dimension-ality
Recall 데이터 분포 고려 기반 기술Scan X (없음 ) 1 △ N/A
RTree-based Solution O (강함 ) 1 O index: TreeLocality Sensitive
Hashing △ (덜함 ) X Hashing + Mathematics
Data Sensitive Hash-ing △ (덜함 ) O
Hashing+ Machine Learning
Abstract
Abstract• How to pre-train and fine-tune a MNN
To lean a nonlinear transformation From the input space To a low dimensional feature space
Where KNN classification performs well
Improved using unlabeled data
Introduction
Notation• Transformation to Low-Dim Feature Space
Input vectors: Transformation Function
Parameterized by Output vectors:
• Similarity Measure Input vectors: Output:
Objective (informal)
• Goal
Objective (formal)
• Goal: Maximizing
Relative Works: Linear Transforma-tion• Linear Transformation [8,9,18]
Weakness Limited number of parameters
, then should be 30 by 784 matrix(23,520 parameters) In this paper: 785*500 + 501*500 + 501*500 + 2001*30 parameters
Cannot model higher-order correlation
• Deep Autoencoder [14], DBN[12]
In this paper• Non-Linear Transformation
Overview Pre-training: Similar to [12,14]
Stack of RBM RBM1 784-500 RBM2 500-500 RBM3 500-2000 RBM4 2000-30
Fine-tuning: backpropagation To maximize the objective function
Maximize the expected number of correctly classified points
on the training data
Objective (formal)
• Goal: Maximizing
2. Learning Nonlinear NCA
Neighbourhood Component Analysis
Notation
Symbol DefinitionIndex
training vector (d-di-mensional data)
{1,2,…,C} Label of training vectorLabeled training casesOutput of Multilayer
Neural network parame-terized by
Euclidean distance met-ric
The probability thatpoint a selects one of its neighbor b in the trans-
formed feature space
0 1 5 7 7
1 0.3678
0.0497
0.002 0.002
0 0.88 0.11 0 0
𝑝𝑎𝑏=0.3678
0.3678+0.0497+0.0002+0.0002 ≈0.88
Notation
Symbol Definition
The probability thatpoint a selects one of its neighbor b in the transformed feature
space
The probability that point a belongs to
class k
The Expected Num-ber of correctly clas-
sified point on the training data
N/A 3 3 2 1
0 0.88 0.11 0 0
𝑝 (𝑐𝑎=3 )=0.99𝑝 (𝑐𝑎=2 )=0𝑝 (𝑐𝑎=1 )=0
Learning Rule• Backpropagation To maximize
• Derivation
𝜕𝜕 𝑓 (𝑥𝑎∨𝑊 )
¿
Learning Rule• Backpropagation To maximize
• Derivation
Standard backpropagationOutput Layer: Inner Layer:
Learning Rule• Backpropagation To maximize
• Derivation
Details• Pre-training
Mini-batch Each containing 100cases Epoch: 50
Fine-Tuning Method: Conjugate gradients on larger
mini-batches of 5,000 with three line search performed for each mini-batch
Epoch: 50
Dataset 60,000 training images 10,000 for validation
Experiment
Result
Result
Appendix
Regularized Nonlinear NCA
Regularized Nonlinear NCA
Application
• Learn Compact binary codes that allow efficient re-trieval Gist descriptor + Locality Sensitive Hashing Scheme + Non-linear NCA
Dataset: LabelMe 22,000 images label: {human, woman, man, etc}
Torralba, Antonio, Rob Fergus, and Yair Weiss. "Small codes and large image data-bases for recognition." Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on. IEEE, 2008.
http://labelme2.csail.mit.edu/Release3.0/browserTools/php/publications.php
Neural Network
Toy Example: AND gate, XOR gate
AND gate
Z
𝑥
𝑦
1
x y t
𝑤0
𝑤1
𝑤2
sigm
𝑠𝑖𝑔𝑧
𝑧=𝑥 ∙𝑤0+ 𝑦 ∙𝑤1+1 ∙𝑤2
𝑠𝑖𝑔𝑚 ( 𝑥 )= 11+𝑒−𝑥
XOR gate
x y t
𝑧1𝑥𝑦1
𝑤0 0
𝑤01
𝑤0 2
sigm 𝑠𝑖𝑔𝑧1
𝑧 2𝑥𝑦1
𝑤10
𝑤11
𝑤12
sigm 𝑠𝑖𝑔𝑧 2
𝑧 3
1
𝑤20
𝑤21
𝑤22
sigm
𝑠𝑖𝑔 𝑧 3
Implementations• Toy example: Training Algorithm for logic gate• NLNCA for MNIST