supervised learning in neural networks

Supervised Learning in

Neural Networks

Learning when all the correct answers are available.

Remembering -vs- Learning

• Remember: Store a given piece of information such that it can be retrieved and reused in the future in the same (or very similar) way that it was used earlier.

• Learn: Extract useful generalizations (or specializations) from information such that, in the future, it may be:

– Applied to new (previously unseen) situations

– More effectively applied to previously-seen cases

Supervised Learning• Generalizing (and specializing) useful knowledge from

data items, each of which contains both a situation (context) and the proper response (action) for that situation.

• In an educational setting, the teacher provides a problem and THE CORRECT ANSWER.

• In Reinforcement Learning (RL) the teacher only responds “Right” or “Wrong”.

Learning = Weight Adjustment

• Generalized Hebbian Weight Adjustment:– The sign of the weight change = the sign of the correlation between x i and

zj:

∆wji xizi

– zj is:• xj Hopfield networks• dj - xj Perceptrons (dj = desired

output)• dj - ∑xiwji ADALINES “ “

i

xj

xi

wj,i

zj

Perceptrons

Perceptron: A machine that classifies input vectors by applying linear functions to them (Rosenblatt, 1958).

Perceptron Learning Algorithm: A stochastic gradient-descent method for finding a linear function that properly classifies a set of input vectors

(Minsky & Papert, 1969).

X Y

11

-1

-5

xwx + ywy > 5(-1)x + (1)y > 5

5

X Y

1-1

0

(-1)x + (1)y + (-5)(1) > 0

Perceptron Training Rule (Mitchell, 1997)

1wij

0

j

i

oi

ti - oi

xj

Intuitive: If sgn(error) = + , then oi needs to increase. So if xj > 0, then increase wij.

Else if xj < 0, then decrease wij.

If sgn(error) = - , then oi needs to decrease. So if xj > 0, then decrease wij.

Else if xj < 0, then increase wij.

If sgn(error) = 0, then leave wij alone.

If the input vectors are linearlyseparable, and the learning rate issufficiently small, then repeatedpresentation of the training inputs andapplications of this rule will lead toa set of weights that correctlyclassifies all training examples.

jiiij xotw )( −=Δ η

Learning rate Error

Apply to all weightsafter each misclassifiedtraining example

Expected Output

Perceptrons & Adalines

• Descriptions of the 2 vary from book to book.• I/O

– Originally, perceptrons (0 & 1), adalines (-1 & 1). Now, perceptrons can use both.– Both sum inputs and use step function to compute a binary output.

• Training– Perceptron training rule: ∆wij = n(ti-oi)xj (n = learning rate)– Delta rule: ∆wij = n(ti-neti)xj

– neti -vs- oi is only difference, but it is a big difference:• Delta rule can asymptotically approach the error minimum of a non-linearly-

separable problem; perceptron rule cannot.• Perceptron rule can converge to a perfect classifier of a linearly separable

problem; delta rule cannot always do so.• Terminology Warning!!

– Some books use the term “multi-level perceptron”, but neither perceptrons nor adalines in their classic forms are appropriate for trainable multi-level networks, since both use step transfer functions, which are non-continuous and hence non-differentiable.

– The sigmoid units used in most multi-level feed-forward networks are similar to both perceptrons and adalines, but they use a continuous, differentiable sigmoidal transfer function.

Function Learning

• The main application of feed-forward ANNs is to learn a general function (mapping) between a particular domain (D) and range (R) when given a set of examples: {(d, r): d in D, r in R}.

• D and R may contain vectors or scalars

d1d2

d3

d4

r1

r2

r3

Example set = {(d1,r3), (d2,r1), (d3,r2), (d4,r2)}

Goal: Build an ANN that can take ANY element d of D on its inputlayer and produce F(d) on its output layer.Problem: The example set normally represents a very small fraction ofthe complete mapping set (which may be infinite).

F

Standard ANN Approach

• Feed-forward neural net with back-propagation learning.

Examples = {(d1,r3), (d2,r1), (d3,r2), (d4,r2)}

In Out

Encoder Decoderd1

r3

r*E = r3 - r*

Use E to adjust weights basedon their contributions to E, so thatE is reduced.

Classification & Learning

1

wz

X Y

wy

wx

0

1 3 -3 -1

-5 2 2 +1

1 3 -3 -1

1 9 3 +1

-2 4 1 +1

-7 2 4 +1

5 5 -5 -1

x y sum

*cl

ass

1234

case

567

Classification: The perceptron should compute the proper class for each input x-y pair. For a single perceptron, thisis only possible when the input vectors are linearly separable

Learning: Find the proper values for weights wx, wy and wz

so that the perceptron properly classifies all input cases.This is a search problem in weight-vector space.

*Sum assumesvalues -1, 1, -5for the weights.

Training & Testing PhasesData Set

Training Set

TestSet

In Out

N times

1 time

Training:• Epoch = one processing round of the entire training set. • Run training through many epochs. • Record error• Use error to update weights and (eventually) achieve proper discrimination among the classes present in the training cases.

Testing:• Run test set through ANN 1 time• Do not change weights• Record error

Gradient Descent Weight Learning

wa

wk

wm

o1

o3

o2

Expected OutputsInputs Outputs

t1

t2

t3

ErrorwΔ

wΔwΔ

Base weight changes upon their contribution to the error such that theupdated weights will create LESS error on the same training cases.

Contribution = ijw

Error

∂∂

Gradient Descent

w31

w32

E

⎥⎦

⎤⎢⎣

⎡∂∂

∂∂

∂∂

=∇nw

E

w

E

w

EwE ,...,,)(

10

r= Gradient of E w.r.t. the weight vector

Gradient-Descent Training Rule•As described in Machine Learning (Mitchell, 1997).•Also called delta rule, although not quite the same as the adaline delta rule. •Compute Ei, the error at output node i over the set of training instances, D.

ij

iij

w

Ew

∂∂

−=Δ η

Intuitive: Do what you can to reduce Ei, so: If increases in wij will increase Ei (i.e., dEi/dwij > 0), then reduce wij, but ” ” decrease Ei ” ” < 0, then increase wij

• Compute dEi/dwij (i.e. wij’s contribution to error at node i) for every input weight to node i.

• Gradient Descent Method: Updating all wij by the delta rule amounts to moving along the path of steepest descent on the error surface.

• Difficult part: computing dEi/dwij .

• Base weight updates on dEi/dwij

2)(

2

1 ∑∈

−=Dd

ididi otE D = training set

= distance * direction (to move in error space)

Computing dEi/dwij

2)(

2

1 ∑∈

−∂∂

=∂∂

Ddidid

ijij

i otww

E

)()(22

1idid

ijDdidid ot

wot −

∂∂

−= ∑∈

)()( idijDd

idid ow

ot −∂∂

−=∑∈

))(()( idTijDd

idid sumfw

ot −∂∂−=∑

∈∑=

jjdijid xwsumwhere

ij

id

id

TidT

ij w

sum

sum

fsumf

w ∂∂

∂∂=

∂∂ )( jd

id

T xsum

f∂∂=

In general:

Computing )( idTij

sumfw∂∂

idsumidT esumf −+

=1

1)(Sigmoidal ft :

))(1)(()1(1

12 idtidtsum

sum

sumid

sumfsumfe

e

esum id

id

id−=

+=

+∂∂

−

−

−

ididT osumf =)(But since: jdidididTij

xoosumfw

)1()( −=∂∂

ididT sumsumf =)(Identity ft :

1=∂

∂=∂

∂id

idT

id

sumsum

fsum

jdjdij

id

id

TidT

ij

xxw

sum

sum

fsumf

w==

∂∂

∂∂=

∂∂ )1()(

If fT is not continuous, andhence not differentiableeverywhere, then we cannotuse the Delta Rule.

Weight Updates for Simple Units

)()( jdDd

idid xot −−=∑∈

jdDd

ididij

iij xot

w

Ew ∑

∈

−=∂∂−=Δ )(ηη

))(()( idTijDd

ididij

i sumfw

otw

E −∂∂−=

∂∂ ∑

∈

fT = identity function

wijxjd

oid

tid

Eid

Weight Updates for Sigmoidal Units

))(1()( jdididDd

idid xooot −−−=∑∈

jdididDd

ididij

iij xooot

w

Ew )1()( −−=

∂∂−=Δ ∑

∈

ηη

))(()( idTijDd

ididij

i sumfw

otw

E −∂∂−=

∂∂ ∑

∈

fT = sigmoidal function

wijxjd

oid

tid

Eid

Incremental Gradient Descent

jdididididij xoootw )1()( −−=Δ η

After each training instance, d, update the weights by:

jdididij xotw )( −=Δ η

Simple Unit:

Sigmoidal Unit:

*Same as perceptron rule

Error Term

i

j

wij

xjd

oid tid

Backpropagation Learning in Multi-Layer ANNs

• Still use gradient-descent (delta) rule:– But now the effects of an arc’s weight change on the error need to be computed

across all nodes along all arcs from the current arc to the output layer.

• Starting from the output layer and moving back through the hidden nodes, compute an error term for each node i.

• Then, for each arc going into node i, compute the contribution of that arc’s weight w ij to the total error as xjd , where xjd is the output value of node j on training example d.

• When an error term has been calculated for every node to which node j sends outputs, then node j’s error contribution can be computed as the product of :

– a) the influence of j’s input sum (netj) upon j’s output.– b) the sum of the contributions of j’s output to each of its downstream neighbors’

error terms.• Each such contribution is simply the weight along the arc times the error

contribution of the node on the downstream end of that arc.

ij

iij

w

Ew

∂∂

−=Δ η

iδ

iδ€ €

BackPropagation

)1()( ididididid

id

id

di ooot

sum

E

sum

E−−=

∂∂

−=∂∂

−=δ

For each input vector: 1. Propagate the inputs forward through the network. 2. Propagate the errors backward through the network.

Error term for output units:

∑∈

−=∂∂

−=Outputsk

kkiididid

di woo

sum

Eδδ )1(

Error term for hidden units:

3. Compute all weights changes

jiij xw ηδ=Δi

jwijxj

iδ

i

mlk

wmiwki

kδ lδ mδ

Incremental -vs- Batch Processing in Backprop Learning

• In both cases, ∆wij needs to be computed after EACH training instance, since ∆w ij is a function of xjd (forall j, d), i.e., the output of each node for each particular training instance.

• But, we can delay the updating of wij until after ALL training instances have been processed (batch mode).

• Incremental mode:

– wij = wij + ∆wij Do after each training instance.

• Batch mode:– sum-∆wij = sum-∆wij + ∆wij Do after each training instance.

– wij = wij + sum-∆wij Do after complete training set

– So the same weight values will be used for each training instance in an epoch = one processing round of an entire training set.

Explaining Error Terms

id

di sum

E

∂∂

−=δ • What is node i’s effect on the total error, Ed? It affects via its sum of inputs on case d, sumid.• Negative sign is merely for convenience when updating wij.

i

jwijxjd

iδ

• For any input weight, wij, to node i, its influence on Ed is simply its effect on sumid times sumid’s effect on Ed.

id

d

ij

id

ij

d

sum

E

w

sum

w

E

∂∂

∂∂

=∂∂

jdk

kdikijij

id xxwdwdw

sum=

∂=

∂ ∑• A weight’s influence on the sum is simply xjd

• So once we compute a node’s influence upon Ed, we can use that value to compute the influences of each weight, and thus to update each weight via the negative of its influence.

jdiij

dij x

w

Ew ηδη =

∂∂

−=Δ

id

id

id

di sum

E

sum

E

∂∂

−=∂∂

−=δ

Output Node Error Term

)()())(2

1( 2

idid

ididididid

osum

ototsum

−∂

∂−−=−

∂∂

−=

)1()()( ididididid

Tidid ooot

sum

fot −−=

∂∂

−=

• The standard error function is a quadratic

• The influence of the sum on the output is simply the derivative of the transfer function with respect to the sum. For a sigmoid unit, we’ve already shown that value to be oid(1-oid)

• If node i is an output node, then the contribution of sumid to Ed is its contribution to the error on the output of node i, Eid.

Hidden Node Error Term

• The influence of output oid on sumk is simply the weight wki. And the other 2 derivatives were computed earlier.

kd

d

Outputsk id

kd

id

id

id

di sum

E

o

sum

sum

o

sum

E

∂∂

∂∂

∂∂

−=∂∂

−= ∑∈

δ

• If node i is a hidden node, then the contribution of sumid to Ed is via its contributions to the error terms of each node that i outputs to.

kiInputsj

jdkjidid

kd wowoo

sum=

∂∂

=∂

∂ ∑∈

kkd

d

sum

E δ−=∂∂

)1( ididid

id oosum

o−=

∂∂

For a sigmoid unit

∑∈

−=Outputsk

kkiididi woo δδ )1(• Putting it all together:

k

iwkioid

iδ

Influence of Hidden Nodes on Ed

i

m

l

k

wmi

wki

id

kd

o

sum

∂∂

oidsumid

id

md

o

sum

∂∂

id

id

sum

o

∂∂

id

ld

o

sum

∂∂

kδ−kd

d

sum

E

∂∂

mδ−md

d

sum

E

∂∂

lδ−ld

d

sum

E

∂∂

summd

sumld

sumkd

Ed

jd

d

Outputsj id

jd

id

id

id

d

sum

E

o

sum

sum

o

sum

E

∂∂

∂∂

∂∂

=∂∂ ∑

∈

Learned XOR: Version I

X

A B

1-5.7

6.5

-5.4 Y

1-3.3

108.6

Z

-3.21.1

1

3.8A B X 1 1 -1 1 -1 1-1 1 -1-1 -1 -1

BA ¬∧

A B Y 1 1 1 1 -1 1-1 1 -1-1 -1 1

)( BA∧¬¬

X Y Z 1 1 1 1 -1 1-1 1 -1-1 -1 1

YXYX ¬∨≡∧¬¬ )(

)()( BABA ∧¬∨¬∧≡

),( BAXOR≡

Slightly sketchy: For thisexample, backpropagation wasused with the perceptron trainingrule instead of the delta rule. Thisis necessary because fT of theperceptron is not differentiable, andthus not amenable to the delta rule.

Cleaner Approach: If perceptron nets,use the GA to find a good weight set.

Learned XOR: Version II

X

A B

14.8

-4.5

.85 Y

1-3.7

4.7 3.3

Z

-.5-.35

1

.8A B X 1 1 1 1 -1 -1-1 1 1-1 -1 1

)( BA ¬∧¬

A B Y 1 1 1 1 -1 1-1 1 -1-1 -1 1

)( BA∧¬¬

X Y Z 1 1 -1 1 -1 1-1 1 1-1 -1 1

YXYX ¬∨¬≡∧¬ )(

)()( BABA ∧¬∨¬∧≡

*The concepts that nodesX and Z represent are different in the twoversions.

),( BAXOR≡

Learned XOR: Version III

X

A B

174

9

-77 Y

1-12.9

-11.2 -1.8

Z

-1.1-.7

1

-.65A B X 1 1 1 1 -1 -1-1 1 -1-1 -1 -1

BA∧

A B Y 1 1 -1 1 -1 -1-1 1 -1-1 -1 1

BA ¬∧¬

X Y Z 1 1 -1 1 -1 -1-1 1 -1-1 -1 1

YX ¬∧¬

*Nodes X, Y and Z representdifferent logical concepts inthis version than in theprevious 2 versions

)( BA∨¬≡

)()( BABA ∨∧∧¬≡

),( BAXOR≡

Recurrent NetworksGoal: Include previous states of the network as input, therebyincluding history in decision making.

Applications: Time Series Prediction (e.g. Stocks, Weather forecasts)Control Systems (Robots, internal environments)

y(t)

y(t-1)

y(t-2)

y(t+1)

x1(t)

x2(t)

His

tory

• This works fine• Normal backprop applies • But, length of the history is hard-wired.

Recurrent Networks (2)

• v(t) means ”value of v after it’s t-th update”• Assume unthresholded neurons in this eg.• r(t) = m(t) + y(t)• m(t+1) = p*r(t) + X(t+1)

where X(t) = x1(t)wm1 + x2(t)wm2

0 <= p <= 1 (decay/forgetting rate)

y(t)

ry(t+1)

x1(t)

x2(t)

m

wm1

wm2

m(1) = p*r(0) + X(1) = X(1) r(1) = m(1) + y(1) = X(1) + y(1)

m(2) = p*r(1) + X(2) = p[X(1) + y(1)] + X(1) r(2) = m(2) + y(2) = p*[X(1) + y(1)] + X(2) + y(2)

m(3) = p*r(2) + X(3) = p2[X(1) + y(1)] + p[X(2) + y(2)] + X(3)

€

m(k) = pi X(k − i) + y(k − i)[ ]i=1

k−1

∑ + X(k) => Decaying importance of earlier inputs & outputs. High (low) p => slow (fast) decay

p

MomentumGradient Descent can easilyget stuck at a local minimumof the error landscape.

Momentum allows previoussearch direction to influencecurrent direction.

?

)()1( twxtw kjjkkj Δ+=+Δ αηδ

Standard Delta-Rule update Momentum term

1,0 ≤≤ αη

Momentum smoothes the error landscape by guiding search in the best average direction (i.e., that whichwill, on average, decrease the error most) for a region (that may be quite jagged).

1

2

momentum

resultant

Best local move

Design Issues for Learning ANNs

• Initial Weights

– Random -vs- Biased

– Width of init range • Typically: [-1 1] or [-0.5 0.5]

• Too wide => large weights will drive many sigmoids to saturation => all output 1 => takes a lot of training to undo.

• Frequency of Weight Updates

– Incremental - after each input.• In some cases, all training instances are not available at the same

time, so the ANN must improve on-line.

• Sensitive to presentation order.

– Batch - after each epoch.• Uses less computation time

• Insensitive to presentation order

Design Issues (2)• Learning Rate

– Low value => slow learning

– High value => faster learning, but oscillation due to overshoot

– Typical range: [0.1 0.9] - Very problem specific!

– Dynamic learning rate (many heuristics):• Gradually decrease over the epochs

• Increase (decrease) whenever performance improves (worsens)

• Use 2nd deriv of error function– d2E/dw2 high => changing dE/dw => rough landscape => lower learning rate

– d2E/dw2 low => dE/dw ~ constant => smooth landscape => raise learning rate

• Length of Training Period

– Too short => Poor discrimination/classification of inputs

– Too long => Overtraining => nearly perfect on training set, but not general enough to handle new (test) cases.

– Many nodes & long training period = recipe for overtraining

– Adding noise to training instances can help prevent overtraining.• (x1, x2…xn) => add noise => (x1+e1, x2 + e2….xn+en)

η

Design Issues (3)

• Size of Training Set– Heuristic: |Training Set | > k |Set of Weights| where k > 5 or 10.

• Stopping Criteria– Low error on training set

• Can lead to overtraining if threshold is too low– Include extra validation set (preliminary test set) and test ANN on it after each

epoch. • Stop when validation error is low enough.

Supervised Learning ANN Applications

• Classification D: feature vectors => R: classes

– Medical Diagnosis: symptoms => disease

– Letter Recognition: pixel array => letter

• Control D: situation state vectors => R: responses/actions

– CMU’s ALVINN: road picture => steering angle

– Chemical plants:

Temperature, Pressure, Chemical Concs in a containerValve settings for heat, chemical inputs/outputs

Prediction

D: Time series of previous states s1,s2…sn => R: next states sn+1

Finance: Price of a stock on days 1…n => price on day n+1 Meteorology: Cloud cover on days 1…n => cloud cover on day n+1

Supervised Learning in the Cerebellum

• The inferior olive gets sensory input from the skin, joints, muscle stretch receptors, etc. These indicate the stresses put on the body as a result of the motor action.

• Climbing fibers from the inferior olive provide feedback (i.e. an error signal) which causes LTP (i.e. depression/weakening) of the Parallel-Purkinje synapses.

• Hence, learning in the cerebellum has a “supervised” flavor to it.

CerebralNeocortex

Mossy Fiber

GranuleCell

Parallel Fibers

PurkinjeCell

ClimbingFiber

To motorcortex (Action!)From inferior

olive

*Arrows denotesignal direction

Thought

Biology & Backpropagation

N

P

Axon Dendrites

Synapses

• When a post-synaptic neuron, P, fires an action potential,this can have electrical (and then chemical) effects upon P’s dendrites. • These effects may even spread to the pre-synaptic neuron, N, and even further back.• BUT, this is only qualitatively similar to formal ANN backprop. It hasno quantitative similarity to the delta rule; i.e. no error signal is being sent backwards as the basis for altering synapses in a manner that willreduce the error.• The evidence for forms of Reinforcement Learning (RL) in the brainare more convincing.

Depolarizingsignal

Backpropagation in Brain Research

• Feedforward ANNs trained using backprop are useful in neurophysiology.

• Given: Pairs (Sensory Inputs, Behavioral Actions) {e.g. from psychological experiments}

• Backprop produces:

– A general mapping/function that both

• covers the I/O pairs and

• can be used to PREDICT the behavioral effects of untested inputs.

– A neural model of how the mapping might be wired in the brain.

• The model tells neurobiologists WHAT to look for and WHERE.

• So even though the means of producing the proper (trained) network is not biological, the RESULT can be very biological and useful.

• The brain has so many neurons and so many connections that biologists need LOTS of help to narrow the search, both in terms of:– the structures of neural circuits in a region

– The functions embodied in those circuits.

• Backprop finds error minima; evolution may have also done so!

Backprop Brain Applications

• Vestibular-Occular Response (VOR) : Discovering synaptic strengths in the circuitry that integrates signals from the ears (vestibular - balance and head velocity) and eyes (occular - visual images) so that we can keep our eyes focused on an object when it and/or we are moving.

• LEECHNET: Tuning (hundreds of) weight parameters in the neural circuitry that governs the bending of leech body segments in response to touch stimuli.

• Stereo Vision: Tuning large networks for integrating images from the right and left eye so as to detect object depth.

Stereoscopic Vision

• From 2 2-d images (taken from points a few cm apart) the brain creates a 3-d picture.

• Integration of the 2-d images

occurs in the visual cortex.

Fixation Cells• Fixation Depth: distance from nose to intersection of sight.

• Vergence Angle: deviation of eyes from parallel

• Fixation cells: neurons of the visual cortex that are only active when their location in the visual cortex is receiving the SAME signal from both eyes.

– Recognize objects at the fixation depth

θd

Fixation Cells

25

41

0

3

345 3 1

345 3 1

35 3 12

35 3 12

Near & Far Cells

Detect correspondences between left and right images that indicate objects in front of or behind the fixation depth. θ

d

Near Cells

2 41

0

35

• The corresponding connections are shifted one unit to the left as compared to the fixation cells•There are near cells for many different levels of nearness.• For greater levels of nearness, shift more.• Far cells work similarly, but with the opposite shift.

13 255

5 4 3 1 1

5 4 3 1 1

13 255

ANN for Stereoscopic Vision

• Fusion.Net Paul Churchland (1992)

• Uses artificial neurons to model the combination of 2-d images in the visual cortex.

• Given: 2 2-d pictures

• Detect: Objects at the fixation depth and in the near & far zones.

• Use same type of neuron for fixation, near and far cells, just use connection patterns with different shifts.

• X is a fixation, near or far cell.• It acts like a NOT-XOR. • If both pixels are the same, then neither of the hidden nodes fires. Hence, the output node is dominated by the constant input, and it fires.• If the pixels differ, then exactly one of the hidden nodes fires, and its output will then inhibit the output node.

11-1-1

X

-10-101

1

Right eye pixel Left eye pixel

Fusion.Net ExampleThe ANN detects patterns at 3 levels: Fixation depth (0) Nearness depth (-1) Farness depth (+1)

When seen through stereoglasses, this is a stack of3 boxes viewed from thetop.

Fusion.net detects the 3 boxes, oneat each depth level. Each box is flat, sincethere are only 3 detection levels, and eachis a plane.

How Realistic is Fusion.net?

• Similar neurons have been found, but the corresponding connections have not been verified

11

-1-1

X

-10-101

1Layer 3

Layer 4

Layer 5

• Both the brain and Fusion.net need brightness variation to stimulate stereopsis

Needed Improvements to Fusion.net:• More depth levels• Remove constant neurons• Input layer cells shouldn’t have both excitatory and inhibitory effects (biologically unrealistic)

supervised learning in neural networks

Documents

learning ratedelta rule

perceptron learning

separable problem delta

step function

term multilevel perceptron

training examples

range r

reinforcement learning