cse/nb 528 lecture 13: from unsupervised learning to ...r. rao, 528: lecture 13 24 supervised...

41
1 R. Rao, 528: Lecture 13 CSE/NB 528 Lecture 13: From Unsupervised Learning to Supervised Learning (Chapters 8 & 10)

Upload: others

Post on 11-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CSE/NB 528 Lecture 13: From Unsupervised Learning to ...R. Rao, 528: Lecture 13 24 Supervised Learning Two Primary Tasks 1. Classification Inputs u 1, u 2, … and discrete classes

1R. Rao, 528: Lecture 13

CSE/NB 528

Lecture 13: From Unsupervised Learning to

Supervised Learning(Chapters 8 & 10)

Page 2: CSE/NB 528 Lecture 13: From Unsupervised Learning to ...R. Rao, 528: Lecture 13 24 Supervised Learning Two Primary Tasks 1. Classification Inputs u 1, u 2, … and discrete classes

2R. Rao, 528: Lecture 13

What’s on the menu today?

Unsupervised Learning Sparse coding and

Predictive coding

Supervised Learning Classification versus Function Approximation/Regression

Perceptrons & Learning Rule

Linear Separability: Minsky-Papert deliver the bad news

Multilayer networks to the rescue

Radial Basis Function Networks

(Copyright, Gary Larson)

Page 3: CSE/NB 528 Lecture 13: From Unsupervised Learning to ...R. Rao, 528: Lecture 13 24 Supervised Learning Two Primary Tasks 1. Classification Inputs u 1, u 2, … and discrete classes

3R. Rao, 528: Lecture 13

Recall: Generative Models for Unsupervised Learning

Causes v

Data u

Generative

model

Suppose input u was

generated by a linear

superposition of

causes v1, v2, …, vk

with basis vectors (or

“features”) gi

(e.g., an image composed

of several features, or

audio containing several

voices)

noisevi

ii gu

Page 4: CSE/NB 528 Lecture 13: From Unsupervised Learning to ...R. Rao, 528: Lecture 13 24 Supervised Learning Two Primary Tasks 1. Classification Inputs u 1, u 2, … and discrete classes

4R. Rao, 528: Lecture 13

Example: “Eigenfaces”

Suppose your basis vectors or “features” gi are the

eigenvectors of input covariance matrix of face images

u1 g

2 g

Page 5: CSE/NB 528 Lecture 13: From Unsupervised Learning to ...R. Rao, 528: Lecture 13 24 Supervised Learning Two Primary Tasks 1. Classification Inputs u 1, u 2, … and discrete classes

5R. Rao, 528: Lecture 13

Linear combination of eigenfaces

11vg 22vg 88vg

u

i

iivg

Image

Page 6: CSE/NB 528 Lecture 13: From Unsupervised Learning to ...R. Rao, 528: Lecture 13 24 Supervised Learning Two Primary Tasks 1. Classification Inputs u 1, u 2, … and discrete classes

6R. Rao, 528: Lecture 13

Linear Generative Model

Suppose input u was generated by linear superposition of

causes v1, v2, …, vk and basis vectors or “features” gi:

Problem: For a set of inputs u, estimate causes vi for each u

and learn feature vectors gi

Suppose number of causes is much lesser than size of input

Idea: Find v and G that minimize reconstruction errors:

noiseGnoisevi

ii vgu

)()(2

1

2

12

vuvugu GGvE T

i

ii

Page 7: CSE/NB 528 Lecture 13: From Unsupervised Learning to ...R. Rao, 528: Lecture 13 24 Supervised Learning Two Primary Tasks 1. Classification Inputs u 1, u 2, … and discrete classes

7R. Rao, 528: Lecture 13

Probabilistic Interpretation

E is the same as the negative log likelihood of data:

Likelihood = Gaussian with mean Gv and identity

covariance matrix I

),;(];|[ IGNGp vuvu

CGGGpE T )()(2

1];|[log vuvuvu

Minimizing error function E is the same as

maximizing log likelihood of the data

Page 8: CSE/NB 528 Lecture 13: From Unsupervised Learning to ...R. Rao, 528: Lecture 13 24 Supervised Learning Two Primary Tasks 1. Classification Inputs u 1, u 2, … and discrete classes

8R. Rao, 528: Lecture 13

Bayesian approach

Would like to maximize posterior:

Equivalently, find v and G that maximize:

];[log];|[log),( GpGpGF vvuv

Prior for causes (what should this be?)

];[];|[];|[ GpGpGp vvuuv

Page 9: CSE/NB 528 Lecture 13: From Unsupervised Learning to ...R. Rao, 528: Lecture 13 24 Supervised Learning Two Primary Tasks 1. Classification Inputs u 1, u 2, … and discrete classes

9R. Rao, 528: Lecture 13

What do we know about the causes v?

We would like the causes to be independent If cause A and cause B always occur together, then perhaps

they should be treated as a single cause AB?

Examples: Image: Composed of several independent edges

Sound: Composed of independent spectral components

Objects: Composed of several independent parts

Page 10: CSE/NB 528 Lecture 13: From Unsupervised Learning to ...R. Rao, 528: Lecture 13 24 Supervised Learning Two Primary Tasks 1. Classification Inputs u 1, u 2, … and discrete classes

10R. Rao, 528: Lecture 13

What do we know about the causes v?

We would like the causes to be independent

Idea 1: We would like:

Idea 2: If causes are independent, only a few of them will be

active for any input

va will be 0 most of the time but high for a few inputs

Suggests a sparse distribution for the prior p[va;G]

];[];[ GvpGpa

av

Page 11: CSE/NB 528 Lecture 13: From Unsupervised Learning to ...R. Rao, 528: Lecture 13 24 Supervised Learning Two Primary Tasks 1. Classification Inputs u 1, u 2, … and discrete classes

11R. Rao, 528: Lecture 13

Prior Distributions for Causes

))(exp(];[ a

avgGp v

||)( vvg

)1log()( 2vvg

Spikes in area IT in

monkey viewing TV

Possible prior

distributionsLog prior

sparse

Page 12: CSE/NB 528 Lecture 13: From Unsupervised Learning to ...R. Rao, 528: Lecture 13 24 Supervised Learning Two Primary Tasks 1. Classification Inputs u 1, u 2, … and discrete classes

12R. Rao, 528: Lecture 13

Finding the optimal v and G

Want to maximize:

Approximate EM algorithm:

E step: Maximize F with respect to v keeping G fixed

Set dv/dt dF/dv (“gradient ascent/hill-climbing”)

M step: Maximize F with respect to G, given the v above

Set dG/dt dF/dG (“gradient ascent/hill-climbing”)

KvgGG

GpGpGF

a

a

T

)()()(2

1

];[log];|[log),(

vuvu

vvuv

(During implementation, let v converge for each input

before changing synaptic weights G)

Page 13: CSE/NB 528 Lecture 13: From Unsupervised Learning to ...R. Rao, 528: Lecture 13 24 Supervised Learning Two Primary Tasks 1. Classification Inputs u 1, u 2, … and discrete classes

13R. Rao, 528: Lecture 13

E Step: Estimating v

)()( vvuv

gGGdt

d T Firing rate dynamics

(Recurrent network)

Error Sparseness constraint

)()( vvuv

vgGG

d

dF

dt

d T Gradient

ascent

Reconstruction

(prediction) of u

Page 14: CSE/NB 528 Lecture 13: From Unsupervised Learning to ...R. Rao, 528: Lecture 13 24 Supervised Learning Two Primary Tasks 1. Classification Inputs u 1, u 2, … and discrete classes

14R. Rao, 528: Lecture 13

Recurrent network for estimating v

)( vu G vG

)()( vvuv

gGGdt

d T

PredictionError

Correction

[Suggests a role for feedback pathways in the cortex (Rao & Ballard, 1999)]

Page 15: CSE/NB 528 Lecture 13: From Unsupervised Learning to ...R. Rao, 528: Lecture 13 24 Supervised Learning Two Primary Tasks 1. Classification Inputs u 1, u 2, … and discrete classes

15R. Rao, 528: Lecture 13

M step: Learning the Synaptic Weights G

)( vu G vG

T

G Gdt

dGvvu )(

Hebbian!

(similar to Oja’s rule)

Learning

rule

PredictionError

TGdG

dF

dt

dGvvu )(

Gradient

ascent

Page 16: CSE/NB 528 Lecture 13: From Unsupervised Learning to ...R. Rao, 528: Lecture 13 24 Supervised Learning Two Primary Tasks 1. Classification Inputs u 1, u 2, … and discrete classes

16R. Rao, 528: Lecture 13

Result: Learning G for Natural Images

Each square is a column

gi of G (obtained by

collapsing rows of the

square into a vector)

vgu Gvi

ii

Any image patch u

can be expressed as:

Almost all the gi

represent local edge

features

(Olshausen & Field, 1996)

Page 17: CSE/NB 528 Lecture 13: From Unsupervised Learning to ...R. Rao, 528: Lecture 13 24 Supervised Learning Two Primary Tasks 1. Classification Inputs u 1, u 2, … and discrete classes

17R. Rao, 528: Lecture 13

Sparse Coding Network is a special case of

Predictive Coding Networks

(Rao, Vision Research, 1999)

(See also Chapter 12 in the Anastasio textbook)

Page 18: CSE/NB 528 Lecture 13: From Unsupervised Learning to ...R. Rao, 528: Lecture 13 24 Supervised Learning Two Primary Tasks 1. Classification Inputs u 1, u 2, … and discrete classes

18R. Rao, 528: Lecture 13

Predictive Coding Model of Visual Cortex

(Rao & Ballard, Nature Neurosci., 1999)

Page 19: CSE/NB 528 Lecture 13: From Unsupervised Learning to ...R. Rao, 528: Lecture 13 24 Supervised Learning Two Primary Tasks 1. Classification Inputs u 1, u 2, … and discrete classes

19R. Rao, 528: Lecture 13

Predictive coding model explains contextual effects

Monkey Primary Visual Cortex Model

(Rao & Ballard, Nature Neurosci., 1999)(Zipser et al., J. Neurosci., 1996)

Page 20: CSE/NB 528 Lecture 13: From Unsupervised Learning to ...R. Rao, 528: Lecture 13 24 Supervised Learning Two Primary Tasks 1. Classification Inputs u 1, u 2, … and discrete classes

20R. Rao, 528: Lecture 13

Contextual effects arise from Natural Image properties

(Rao & Ballard, Nature Neurosci., 1999)

Page 21: CSE/NB 528 Lecture 13: From Unsupervised Learning to ...R. Rao, 528: Lecture 13 24 Supervised Learning Two Primary Tasks 1. Classification Inputs u 1, u 2, … and discrete classes

21R. Rao, 528: Lecture 13

What if your data comes with not just inputs but

also outputs?

Enter…Supervised Learning

Page 22: CSE/NB 528 Lecture 13: From Unsupervised Learning to ...R. Rao, 528: Lecture 13 24 Supervised Learning Two Primary Tasks 1. Classification Inputs u 1, u 2, … and discrete classes

22R. Rao, 528: Lecture 13

Example: Supervised Learning for Face Detection

Can we learn a network to distinguish

faces from other objects?

Page 23: CSE/NB 528 Lecture 13: From Unsupervised Learning to ...R. Rao, 528: Lecture 13 24 Supervised Learning Two Primary Tasks 1. Classification Inputs u 1, u 2, … and discrete classes

23R. Rao, 528: Lecture 13

The Classification Problem

denotes output of +1 (faces)

denotes output of -1 (other)Faces

Other objects

Idea: Find a separating hyperplane (line in this case)

Page 24: CSE/NB 528 Lecture 13: From Unsupervised Learning to ...R. Rao, 528: Lecture 13 24 Supervised Learning Two Primary Tasks 1. Classification Inputs u 1, u 2, … and discrete classes

24R. Rao, 528: Lecture 13

Supervised Learning

Two Primary Tasks

1. Classification

Inputs u1, u2, … and discrete classes C1, C2, …, Ck

Training examples: (u1, C2), (u2, C7), etc.

Learn the mapping from an arbitrary input to its class

Example: Inputs = images, output classes = face, not a face

2. Function Approximation (Regression)

Inputs u1, u2, … and continuous outputs v1, v2, …

Training examples: (input, desired output) pairs

Learn to map an arbitrary input to its corresponding output

Example: Highway driving

Input = road image, output = steering angle

Page 25: CSE/NB 528 Lecture 13: From Unsupervised Learning to ...R. Rao, 528: Lecture 13 24 Supervised Learning Two Primary Tasks 1. Classification Inputs u 1, u 2, … and discrete classes

25R. Rao, 528: Lecture 13

Classification using “Perceptrons”

Fancy name for a type of layered feedforward networks

Uses artificial neurons (“units”) with binary inputs and

outputs

Multilayer

Single-layer

Page 26: CSE/NB 528 Lecture 13: From Unsupervised Learning to ...R. Rao, 528: Lecture 13 24 Supervised Learning Two Primary Tasks 1. Classification Inputs u 1, u 2, … and discrete classes

26R. Rao, 528: Lecture 13

Perceptrons use “Threshold Units”

Artificial neuron:

m binary inputs (-1 or 1) and 1 output (-1 or 1)

Synaptic weights wij

Threshold i

Inputs uj

(-1 or +1)Output vi

(-1 or +1)

Weighted Sum Threshold

(x) = +1 if x 0 and -1 if x < 0

)( ij

j

iji uwv

Page 27: CSE/NB 528 Lecture 13: From Unsupervised Learning to ...R. Rao, 528: Lecture 13 24 Supervised Learning Two Primary Tasks 1. Classification Inputs u 1, u 2, … and discrete classes

27R. Rao, 528: Lecture 13

What does a Perceptron compute?

Consider a single-layer perceptron

Weighted sum forms a linear hyperplane (line, plane, …)

Everything on one side of hyperplane is in class 1 (output =

+1) and everything on other side is class 2 (output = -1)

Any function that is linearly separable can be computed by

a perceptron

0 ij

j

ijuw

Page 28: CSE/NB 528 Lecture 13: From Unsupervised Learning to ...R. Rao, 528: Lecture 13 24 Supervised Learning Two Primary Tasks 1. Classification Inputs u 1, u 2, … and discrete classes

28R. Rao, 528: Lecture 13

Linear Separability

Example: AND function is linearly separable

a AND b = 1 if and only if a = 1 and b = 1

Linear hyperplane v

u1 u2

= 1.5(1,1)

1

-1

1

-1u1

u2

Perceptron for AND+1 output

-1 output

Page 29: CSE/NB 528 Lecture 13: From Unsupervised Learning to ...R. Rao, 528: Lecture 13 24 Supervised Learning Two Primary Tasks 1. Classification Inputs u 1, u 2, … and discrete classes

29R. Rao, 528: Lecture 13

Perceptron Learning Rule

Given inputs u and desired output vd, adjust w as follows:

1. Compute error signal e = (vd – v) where v is the current output

2. Change weights according to the error:

E.g., for positive inputs, this increases weights if error is

positive and decreases weights if error is negative (opposite

for negative inputs)

uww )( vvd BABA with replace means

Page 30: CSE/NB 528 Lecture 13: From Unsupervised Learning to ...R. Rao, 528: Lecture 13 24 Supervised Learning Two Primary Tasks 1. Classification Inputs u 1, u 2, … and discrete classes

30R. Rao, 528: Lecture 13

What about the XOR function?

1

-1

1

-1u1

u2

-1 -1 +1

1 -1 -1

-1 1 -1

1 1 +1

u1 u2 XOR

Can a straight line separate the +1 outputs from

the -1 outputs?

? +1 output

-1 output

Page 31: CSE/NB 528 Lecture 13: From Unsupervised Learning to ...R. Rao, 528: Lecture 13 24 Supervised Learning Two Primary Tasks 1. Classification Inputs u 1, u 2, … and discrete classes

31R. Rao, 528: Lecture 13

Linear Inseparability

Single-layer perceptron with threshold units fails if

classification task is not linearly separable

Example: XOR

No single line can separate the “yes” (+1)

outputs from the “no” (-1) outputs!

Minsky and Papert’s book

showing such negative results put

a damper on neural networks

research for over a decade!

(1,1)

1

-1

1

-1u1

u2

X

Page 32: CSE/NB 528 Lecture 13: From Unsupervised Learning to ...R. Rao, 528: Lecture 13 24 Supervised Learning Two Primary Tasks 1. Classification Inputs u 1, u 2, … and discrete classes

32R. Rao, 528: Lecture 13

How do we deal with linear inseparability?

Page 33: CSE/NB 528 Lecture 13: From Unsupervised Learning to ...R. Rao, 528: Lecture 13 24 Supervised Learning Two Primary Tasks 1. Classification Inputs u 1, u 2, … and discrete classes

33R. Rao, 528: Lecture 13

Multilayer Perceptrons

Removes limitations of single-layer networks Can solve XOR

An example of a two-layer perceptron that computes XOR

Output is +1 if and only if x + y + 2(– x – y – 1.5) > – 1

-1

2

= -1

= 1.5

-1

1 1

x y

(Inputs x and y can be +1 or -1)

Page 34: CSE/NB 528 Lecture 13: From Unsupervised Learning to ...R. Rao, 528: Lecture 13 24 Supervised Learning Two Primary Tasks 1. Classification Inputs u 1, u 2, … and discrete classes

34R. Rao, 528: Lecture 13

What if you want to approximate a

continuous function?

Can a network learn to drive?

Page 35: CSE/NB 528 Lecture 13: From Unsupervised Learning to ...R. Rao, 528: Lecture 13 24 Supervised Learning Two Primary Tasks 1. Classification Inputs u 1, u 2, … and discrete classes

35R. Rao, 528: Lecture 13

Example Network

Input u = [u1 u2 … u960] = image pixels

Steering angle

Current image

Desired Output:

d = [d1 d2 … d30]

Page 36: CSE/NB 528 Lecture 13: From Unsupervised Learning to ...R. Rao, 528: Lecture 13 24 Supervised Learning Two Primary Tasks 1. Classification Inputs u 1, u 2, … and discrete classes

36R. Rao, 528: Lecture 13

Function Approximation

We want networks that can learn a function Network maps real-valued inputs to real-valued outputs

Want to generalize to predict outputs for new inputs

Idea: Given input data, map input to desired output by

adapting weights

Page 37: CSE/NB 528 Lecture 13: From Unsupervised Learning to ...R. Rao, 528: Lecture 13 24 Supervised Learning Two Primary Tasks 1. Classification Inputs u 1, u 2, … and discrete classes

37R. Rao, 528: Lecture 13

Example: Radial Basis Function (RBF) Networks

input nodes

output neurons

one layer of

hidden neurons

Page 38: CSE/NB 528 Lecture 13: From Unsupervised Learning to ...R. Rao, 528: Lecture 13 24 Supervised Learning Two Primary Tasks 1. Classification Inputs u 1, u 2, … and discrete classes

38R. Rao, 528: Lecture 13

Radial Basis Function Networks

2

1

2,

2

)(

n

ijii wx

j ehHidden layer output:

(Gaussian bell-shaped function)

input nodes

output neurons

h

ix

jiw ,

Page 39: CSE/NB 528 Lecture 13: From Unsupervised Learning to ...R. Rao, 528: Lecture 13 24 Supervised Learning Two Primary Tasks 1. Classification Inputs u 1, u 2, … and discrete classes

39R. Rao, 528: Lecture 13

Radial Basis Function Networks

output of network:

i

ijij hv ,out

input nodes

output neurons

• Main Idea: Use a mixture of Gaussian

functions hi to approximate the output

• Gaussians are called “basis functions”

jiv ,

ih

Page 40: CSE/NB 528 Lecture 13: From Unsupervised Learning to ...R. Rao, 528: Lecture 13 24 Supervised Learning Two Primary Tasks 1. Classification Inputs u 1, u 2, … and discrete classes

40R. Rao, 528: Lecture 13

RBF networks

Each hidden unit stores a mean (in its weights) and a

variance

Each hidden unit computes a Gaussian function of input x

Can derive learning rules for output weights vi, means wi,

and variances 2i by minimizing squared output error

function (via gradient descent learning)

See http://en.wikipedia.org/wiki/Radial_basis_function_network for more

details and links.

Page 41: CSE/NB 528 Lecture 13: From Unsupervised Learning to ...R. Rao, 528: Lecture 13 24 Supervised Learning Two Primary Tasks 1. Classification Inputs u 1, u 2, … and discrete classes

41R. Rao, 528: Lecture 13

Next Class: Backpropagation and Reinforcement

Learning

Things to do: Read Chapter 9

Finish Homework 3 (due Friday, May 20)

Work on group project

I’ll be starring in

reinf. learning