contrastive divergence learning geoffrey e. hinton a discussion led by oliver woodford

12
Contrastive Divergence Learning Geoffrey E. Hinton A discussion led by Oliver Woodford

Upload: grace-mooney

Post on 28-Mar-2015

221 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Contrastive Divergence Learning Geoffrey E. Hinton A discussion led by Oliver Woodford

Contrastive Divergence Learning

Geoffrey E. Hinton

A discussion led by Oliver Woodford

Page 2: Contrastive Divergence Learning Geoffrey E. Hinton A discussion led by Oliver Woodford

Contents

• Maximum Likelihood learning

• Gradient descent based approach

• Markov Chain Monte Carlo sampling

• Contrastive Divergence

• Further topics for discussion:– Result biasing of Contrastive Divergence – Product of Experts– High-dimensional data considerations

Page 3: Contrastive Divergence Learning Geoffrey E. Hinton A discussion led by Oliver Woodford

• Given:– Probability model

• - model parameters• - the partition function, defined as

– Training data

• Aim:– Find that maximizes likelihood of training data:

– Or, that minimizes negative log of likelihood:

Maximum Likelihood learning

X = fxkgKk=1

p(x;£) = 1Z (£ ) f (x;£ )

Z(£)

£

Z(£) =Rf (x;£ ) dx

£

p(X ;£) =Q Kk=1

1Z(£ ) f (xk;£ )

£

Toy example

Known result:

E (X ;£) = K log(Z(£)) ¡P K

k=1log(f (xk;£ ))

f (x;£ ) = exp¡( x ¡ ¹ )2

2¾2

£ = f ¹ ;¾g

Z(£) =¾p2¼

Page 4: Contrastive Divergence Learning Geoffrey E. Hinton A discussion led by Oliver Woodford

• Method:– at minimum

– Let’s assume that there is no linear solution…

Maximum Likelihood learning

@E (X ;£ )@£ =0

@E(X ;£)@£ =

@logZ(£)@£

¡1K

KX

i=1

@logf (xi ;£ )@£

=@logZ(£)

@£¡¿@logf (x;£ )

À

X

is the expectation of given the data distribution .h¢iX ¢ X

@E (X ;£ )@£ = @log(¾

p2¼)

@£ +¿@( x ¡ ¹ ) 2

2¾2

À

X@E (X ;£ )

@¹ = ¡­x¡ ¹¾2

®X = 0 ) ¹ = hxiX

@E (X ;£ )@¾ = 1

¾+D(x¡ ¹ )2

¾3

E

X= 0 ) ¾=

ph(x ¡ ¹ )2iX

Page 5: Contrastive Divergence Learning Geoffrey E. Hinton A discussion led by Oliver Woodford

– Move a fixed step size, , in the direction of steepest gradient. (Not line search – see why later).

– This gives the following parameter update equation:

Gradient descent-based approach´

£ t+1 = £ t ¡ ´@E(X ;£ t)

@£ t

= £ t ¡ ´µ@logZ(£ t)

@£ t¡¿@logf (x;£ t)

@£ t

À

X

Page 6: Contrastive Divergence Learning Geoffrey E. Hinton A discussion led by Oliver Woodford

– Recall . Sometimes this integral will be algebraically intractable.

– This means we can calculate neither nor (hence no line search).

– However, with some clever substitution…

– so

where can be estimated numerically.

Gradient descent-based approachZ(£) =

Rf (x;£ ) dx

E (X ;£)

@logZ (£ )@£

@logZ (£ )@£ = 1

Z (£ )@Z (£ )@£ = 1

Z (£ )@@£

Rf (x;£) dx

= 1Z (£ )

R @f (x;£ )@£

dx = 1Z (£ )

Rf (x;£)@log f (x;£ )

@£dx

=Rp(x;£)@log f (x;£ )

@£dx =

D@log f (x;£ )

E

p(x;£ )

D@log f (x;£ )

E

p(x;£ )

£ t+1 =£ t ¡ ´µD

@logf (x;£ t )@£ t

E

p(x;£ t )¡D@logf (x;£ t )

@£ t

E

X

Page 7: Contrastive Divergence Learning Geoffrey E. Hinton A discussion led by Oliver Woodford

– To estimate we must draw samples from .

– Since is unknown, we cannot draw samples randomly from a cumulative distribution curve.

– Markov Chain Monte Carlo (MCMC) methods turn random samples into samples from a proposed distribution, without knowing .

– Metropolis algorithm:• Perturb samples e.g.

• Reject if

• Repeat cycle for all samples until stabilization of the distribution.

– Stabilization takes many cycles, and there is no accurate criteria for determining when it has occurred.

Markov Chain Monte Carlo samplingD@log f (x;£ )

E

p(x;£ )p(x;£)

Z(£)

x0k =xk +randn(size(xk))

x0kp(x0k ;£ )p(xk ;£ )

< rand(1)

Z(£)

Page 8: Contrastive Divergence Learning Geoffrey E. Hinton A discussion led by Oliver Woodford

– Let us use the training data, , as the starting point for our MCMC sampling.

– Our parameter update equation becomes:

Markov Chain Monte Carlo samplingX

£ t+1 =£ t ¡ ´µD

@logf (x;£ t )@£ t

E

X 1£ t

¡D@logf (x;£ t )

@£ t

E

X 0£ t

Notation: - training data, - training data after cycles of MCMC,

- samples from proposed distribution with parameters .

nX 1£

X 0£

X n£

£

Page 9: Contrastive Divergence Learning Geoffrey E. Hinton A discussion led by Oliver Woodford

– Let us make the number of MCMC cycles per iteration small, say even 1.

– Our parameter update equation is now:

– Intuition: 1 MCMC cycle is enough to move the data from the target distribution towards the proposed distribution, and so suggest which direction the proposed distribution should move to better model the training data.

Contrastive divergence

£ t+1 =£ t ¡ ´µD

@logf (x;£ t )@£ t

E

X 1£ t

¡D@logf (x;£ t )

@£ t

E

X 0£ t

Page 10: Contrastive Divergence Learning Geoffrey E. Hinton A discussion led by Oliver Woodford

Contrastive divergence bias – We assume:

– ML learning equivalent to minimizing , where

(Kullback-Leibler divergence).– CD attempts to minimize

– Usually , but can sometimes bias results.

– See “On Contrastive Divergence Learning”, Carreira-Perpinan & Hinton, AIStats 2005, for more details.

P jjQ =Rp(x) log p(x)

q(x) dx

@E (X ;£ )@£ ¼

D@logf (x;£ )

E

X 1£

¡D@logf (x;£ )

E

X 0£

X 0£ jjX

X 0£ jjX

1£ ¡ X 1

£ jjX1£

@@£ (X

0£ jjX

1£ ¡ X

1£ jjX

1£ ) =

D@logf (x;£ )

E

X 1£

¡D@logf (x;£ )

E

X 0£

¡ @X 1£

@£@X 1

£ jjX1£

@X 1£

@X 1£

@£@X 1

£ jjX1£

@X 1£

¼0

Page 11: Contrastive Divergence Learning Geoffrey E. Hinton A discussion led by Oliver Woodford

Product of Experts

Page 12: Contrastive Divergence Learning Geoffrey E. Hinton A discussion led by Oliver Woodford

Dimensionality issues