modeling latent variable uncertainty for loss-based learning daphne koller stanford university ben...

40
Modeling Latent Variable Uncertainty for Loss-based Learning Daphne Koller Stanford Universit Ben Packer tanford University M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay, Île-de-France

Upload: lee-sullivan

Post on 27-Dec-2015

225 views

Category:

Documents


0 download

TRANSCRIPT

Modeling Latent Variable Uncertainty for Loss-based Learning

Daphne KollerStanford University

Ben PackerStanford University

M. Pawan KumarÉcole Centrale Paris

École des Ponts ParisTechINRIA Saclay, Île-de-France

AimAccurate learning with weakly supervised data

Train Input xi Output yi

Bison

Deer

Elephant

Giraffe

Llama

Rhino Object Detection

Input x

Output y = “Deer”Latent Variable h

(y(f),h(f)) = argmaxy,h f(Ψ(x,y,h))

AimAccurate learning with weakly supervised data

Feature Ψ(x,y,h) (e.g. HOG)

Input x

Output y = “Deer”

Prediction

Function f : Ψ(x,y,h) (-∞, +∞)

Latent Variable h

f* = argminf Objective(f)

AimAccurate learning with weakly supervised data

Feature Ψ(x,y,h) (e.g. HOG)

Input x

Output y = “Deer”

Function f : Ψ(x,y,h) (-∞, +∞)

Learning

Latent Variable h

AimFind a suitable objective function to learn f*

Feature Ψ(x,y,h) (e.g. HOG)

Input x

Output y = “Deer”

Function f : Ψ(x,y,h) (-∞, +∞)

Learning

Encourages accurate prediction

User-specified criterion for accuracy

f* = argminf Objective(f)

Latent Variable h

• Previous Methods

• Our Framework

• Optimization

• Results

• Ongoing and Future Work

Outline

Latent SVM

Linear function parameterized by w

Prediction (y(w), h(w)) = argmaxy,h wTΨ(x,y,h)

Learning minw Σi Δ(yi,yi(w),hi(w))

✔ Loss based learning

✖ Loss independent of true (unknown) latent variable

✖ Doesn’t model uncertainty in latent variables

User-defined loss

Expectation Maximization

Joint probability Pθ(y,h|x) =exp(θTΨ(x,y,h))

Z

Prediction (y(θ), h(θ)) = argmaxy,h Pθ(y,h|x)

Expectation Maximization

Joint probability Pθ(y,h|x) =exp(θTΨ(x,y,h))

Z

Prediction (y(θ), h(θ)) = argmaxy,h θTΨ(x,y,h)

Learning maxθ Σi log (Pθ(yi|xi))

Expectation Maximization

Joint probability Pθ(y,h|x) =exp(θTΨ(x,y,h))

Z

Prediction (y(θ), h(θ)) = argmaxy,h θTΨ(x,y,h)

Learning maxθ Σi Σhi log (Pθ(yi,hi|xi))

✔ Models uncertainty in latent variables

✖ Doesn’t model accuracy of latent variable prediction

✖ No user-defined loss function

• Previous Methods

• Our Framework

• Optimization

• Results

• Ongoing and Future Work

Outline

Problem

Model Uncertainty in Latent Variables

Model Accuracy of Latent Variable Predictions

Solution

Model Uncertainty in Latent Variables

Model Accuracy of Latent Variable Predictions

Use two different distributions for the two different tasks

Solution

Model Accuracy of Latent Variable Predictions

Use two different distributions for the two different tasks

Pθ(hi|yi,xi)

hi

SolutionUse two different distributions for the two different tasks

hi

Pw(yi,hi|xi)

(yi,hi)(yi(w),hi(w))

Pθ(hi|yi,xi)

The Ideal CaseNo latent variable uncertainty, correct prediction

hi

Pw(yi,hi|xi)

(yi,hi)(yi(w),hi(w))

Pθ(hi|yi,xi)

The Ideal CaseNo latent variable uncertainty, correct prediction

hi

Pw(yi,hi|xi)

(yi,hi)(yi(w),hi(w))

Pθ(hi|yi,xi)

hi(w)

The Ideal CaseNo latent variable uncertainty, correct prediction

hi

Pw(yi,hi|xi)

(yi,hi)(yi,hi(w))

Pθ(hi|yi,xi)

hi(w)

In PracticeRestrictions in the representation power of models

hi

Pw(yi,hi|xi)

(yi,hi)(yi(w),hi(w))

Pθ(hi|yi,xi)

Our FrameworkMinimize the dissimilarity between the two distributions

hi

Pw(yi,hi|xi)

(yi,hi)(yi(w),hi(w))

Pθ(hi|yi,xi)

User-defined dissimilarity measure

Our FrameworkMinimize Rao’s Dissimilarity Coefficient

hi

Pw(yi,hi|xi)

(yi,hi)(yi(w),hi(w))

Pθ(hi|yi,xi)

Σh Δ(yi,h,yi(w),hi(w))Pθ(h|yi,xi)

Our FrameworkMinimize Rao’s Dissimilarity Coefficient

hi

Pw(yi,hi|xi)

(yi,hi)(yi(w),hi(w))

Pθ(hi|yi,xi)

- β Σh,h’ Δ(yi,h,yi,h’)Pθ(h|yi,xi)Pθ(h’|yi,xi)

Hi(w,θ)

Our FrameworkMinimize Rao’s Dissimilarity Coefficient

hi

Pw(yi,hi|xi)

(yi,hi)(yi(w),hi(w))

Pθ(hi|yi,xi)

- (1-β) Δ(yi(w),hi(w),yi(w),hi(w))

- β Hi(θ,θ)Hi(w,θ)

Our FrameworkMinimize Rao’s Dissimilarity Coefficient

hi

Pw(yi,hi|xi)

(yi,hi)(yi(w),hi(w))

Pθ(hi|yi,xi)

- β Hi(θ,θ)Hi(w,θ)minw,θ Σi

• Previous Methods

• Our Framework

• Optimization

• Results

• Ongoing and Future Work

Outline

Optimization

minw,θ Σi Hi(w,θ) - β Hi(θ,θ)

Initialize the parameters to w0 and θ0

Repeat until convergence

End

Fix w and optimize θ

Fix θ and optimize w

Optimization of θ

minθ Σi Σh Δ(yi,h,yi(w),hi(w))Pθ(h|yi,xi) - β Hi(θ,θ)

hi

Pθ(hi|yi,xi)

Case I: yi(w) = yi

hi(w)

Optimization of θ

minθ Σi Σh Δ(yi,h,yi(w),hi(w))Pθ(h|yi,xi) - β Hi(θ,θ)

hi

Pθ(hi|yi,xi)

Case I: yi(w) = yi

hi(w)

Optimization of θ

minθ Σi Σh Δ(yi,h,yi(w),hi(w))Pθ(h|yi,xi) - β Hi(θ,θ)

hi

Pθ(hi|yi,xi)

Case II: yi(w) ≠ yi

Optimization of θ

minθ Σi Σh Δ(yi,h,yi(w),hi(w))Pθ(h|yi,xi) - β Hi(θ,θ)

hi

Pθ(hi|yi,xi)

Case II: yi(w) ≠ yi

Stochastic subgradient descent

Optimization of w

minw Σi Σh Δ(yi,h,yi(w),hi(w))Pθ(h|yi,xi)

Expected loss, models uncertainty

Form of optimization similar to Latent SVM

Observation: When Δ is independent of true h,our framework is equivalent to Latent SVM

Concave-Convex Procedure (CCCP)

• Previous Methods

• Our Framework

• Optimization

• Results

• Ongoing and Future Work

Outline

Object Detection

Bison

Deer

Elephant

Giraffe

Llama

Rhino

Input x

Output y = “Deer”

Latent Variable h

Mammals Dataset

60/40 Train/Test Split

5 Folds

Train Input xi Output yi

Results – 0/1 Loss

Fold 1 Fold 2 Fold 3 Fold 4 Fold 50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Average Test Loss

LSVMOur

Statistically Significant

Results – Overlap Loss

Fold 1 Fold 2 Fold 3 Fold 4 Fold 50

0.1

0.2

0.3

0.4

0.5

0.6

Average Test Loss

LSVMOur

Action DetectionInput x

Output y = “Using Computer”

Latent Variable h

PASCAL VOC 2011

60/40 Train/Test Split

5 Folds

Jumping

Phoning

Playing Instrument

Reading

Riding Bike

Riding Horse

Running

Taking Photo

Using Computer

Walking

Train Input xi Output yi

Results – 0/1 Loss

Fold 1 Fold 2 Fold 3 Fold 4 Fold 50

0.2

0.4

0.6

0.8

1

1.2

Average Test Loss

LSVMOur

Statistically Significant

Results – Overlap Loss

Fold 1 Fold 2 Fold 3 Fold 4 Fold 50.62

0.64

0.66

0.68

0.7

0.72

0.74

Average Test Loss

LSVMOur

Statistically Significant

• Previous Methods

• Our Framework

• Optimization

• Results

• Ongoing and Future Work

Outline

Slides Deleted !!!