secrets of neural network models ken norman princeton university july 24, 2003 note: these slides...

Secrets of Neural Network Models

Ken Norman

Princeton University

July 24, 2003

Note: These slides have been provided online for the convenience of students attending the 2003 Merck summer school, and for individuals who have explicitly been given permission by Ken Norman. Please do not distribute these slides to third parties without permission from Ken (which is easy to get… just email Ken at [email protected]).

The Plan, and Acknowledgements

The Plan:

• I will teach you all of the the secrets of neural network models in 2.5 hours

• Lecture for the first half

• Hands-on workshop for the second half

Acknowledgements:

• Randy O’Reilly

• my lab: Greg Detre, Ehren Newman, Adler Perotte, and Sean Polyn

The Big Question

• How does the gray glop in your head give rise to cognition?

• We know a lot about the brain, and we also know a lot about cognition

• The real challenge is to bridge between these two levels

Complexity and Levels of Analysis

• The brain is very complex: billions of neurons, trillions of synapses, all changing every nanosecond

• Each neuron is a very complex entity unto itself

• We need to abstract away from this complexity!

• Is there some simpler, higher level for describing what the brain does during cognition?

• We want to draw on neurobiology for ideas about how the brain performs a particular kind of task

• Our models should be consistent with what we know about how the brain performs the task

• But at the same time, we want to include only aspects of neurobiology that are essential for explaining task performance

Learning and Development

• Neural network models provide an explicit, mechanistic account of how the brain changes as a function of experience

• Goals of learning:

• To acquire an internal representation (a model) of the world that allows you to predict what will happen next, and to make inferences about “unseen” aspects of the environment

• The system must be robust to noise/degradation/damage

• Focus of workshop: Use neural networks to explore how the brain meets these goals

Outline of Lecture

• What is a neural network?

• Principles of learning in neural networks:

• Hebbian learning: Simple learning rules that are very good at extracting the statistical structure of the environment (i.e., what things are there in the world, and how are they related to one another)

• Shortcomings of Hebbian learning: It’s good at acquiring coarse category structure (prototypes) but it’s less good at learning about atypical stimuli and arbitrary associations

• Error-driven learning: Very powerful rules that allow networks to learn from their mistakes

Outline, Continued

• The problem of interference in neocortical networks, and how the hippocampus can help alleviate this problem

• Brief discussion of PFC and how networks can support active maintenance in the face of distracting information

• Background information for the “hands-on” portion of the workshop

Overall Philosophy

• The goal is to give you a good set of intuitions for how neural networks function

• I will simplify and gloss over lots of things.

• Please ask questions if you don’t understand what I’m saying...

What is a neural network?

• Neurons measure how much input they receive from other neurons; they “fire” (send a signal) if input exceeds a threshold value

• Input is a function of firing rate and connection strength

• Learning in neural networks involves adjusting connection strength

What is a neural network?

• Key simplifications:

• We reduce all of the complexity of neuronal firing to a single number, the activity of the neuron, that reflects how often the neuron is spiking

• We reduce all of the complexity of synaptic connections between neurons to a single number, the synaptic weight, that reflects how strong the connection is

Neurons are Detectors

• Each neuron is detecting some set of conditions (e.g., smoke detector). Representation is what is detected.

Understanding Neural Components in Terms of the Detector Model

Detector Model

• Neurons feed on each other’s outputs; layers of ever more complicated detectors

• Things can get very complex in terms of content, but each neuron is still carrying out the basic detector function

Two-layer Attractor Networks

Input/Output Layer

Hidden Layer (Internal Representation)

• Model of processing in neocortex• Circles = units (neurons); lines = connections (synapses)• Unit brightness = activity; line thickness = synaptic weight• Connections are symmetric


Input/Output Layer


• Units within a layer compete to become active.• Competition is enforced by inhibitory interneurons that

sample the amount of activity in the layer and send back a proportional amount of inhibition

• Inhibitory interneurons prevent epilepsy in the network• Inhibitory interneurons are not pictured in subsequent

diagrams

I


Input/Output Layer


• These networks are capable of sustaining a stable pattern of activity on their own.

• “Attractor” = a fancy word for “stable pattern of activity”

• Real networks are much larger than this, also > 1 unit is active in the hidden layer...

I

Properties of Two-Layer Attractor Networks

• I will show that these networks are capable of meeting the “learning goals” outlined

• Given partial information (e.g., seeing something that has wings and features), the networks can make a “guess” about other properties of that thing (e.g., it probably flies)

• Networks show graceful degradation

“Pattern Completion” in two layer networks

wings beak feathers flies


“Pattern Completion” in two layer networks


Networks are Robust to Damage, Noise

wings feathers flies

Networks are Robust to Damage, Noise

Learning: Overview

• Learning = changing connection weights

• Learning rules: How to adjust weights based on local information (presynaptic and postsynaptic activity) to produce appropriate network behavior

• Hebbian learning: building a statistical model of the world, without an explicit teacher...

• Error-driven learning: rules that detect undesirable states and change weights to eliminate these undesirable states...

Building a Statistical Model of the World

• The world is inhabited by things with relatively stable sets of features

• We want to wire detectors in our brains to detect these things. How can we do this?

• Answer: Leverage correlation

• The features of a particular thing tend to appear together, and to disappear together; a thing is nothing more than a correlated cluster of features

• Learning mechanisms that are sensitive to correlation will end up representing useful things

Hebbian Learning

• How does the brain learn about correlations?

• Donald Hebb proposed the following mechanism:

• When the pre-synaptic neuron and post-synaptic neuron are active at the same time, strengthen the connection between them

• “neurons that fire together, wire together”

Hebbian Learning

Hebbian Learning

• Proposed by Donald Hebb

• When the pre-synaptic (sending) neuron and post-synaptic (receiving) neuron are active at the same time, strengthen the connection between them

• “neurons that fire together, wire together”

• When two neurons are connected, and one is active but the other is not, reduce the connections between them

• “neurons that fire apart, unwire”

Hebbian Learning

Biology of Hebbian Learning:NMDA-Mediated Long-Term Potentiation

Biology of Hebbian Learning:Long-Term Depression

• When the postsynaptic neuron is depolarized, but presynaptic activity is relatively weak, you get weakening of the synapse

What Does Hebbian Learning Do?

• Hebbian learning tunes units to represent correlated sets of input features.

• Here is why:

• Say that a unit has 1,000 inputs

• In this case, turning on and off a single input feature won’t have a big effect on the unit’s activity

• In contrast, turning on and off a large cluster of 900 input features will have a big effect on the unit’s activity

Hebbian Learning

Hebbian Learning

• Because small clusters of inputs do not reliably activate the receiving unit, the receiving unit does not learn much about these inputs

Hebbian Learning

Hebbian Learning

Big clusters of inputs reliably activate the receiving unit, so the network learns more about big (vs. small) clusters(the “gang effect”).


• Hebbian learning finds the thing in the world that most reliably activates the unit, and tunes the unit to like that thing even more!

Hebbian Learning

scaly slitherswings beak feathers flies


• Hebbian learning finds the thing in the world that most reliably activates the unit, and tunes the unit to like that thing even more!

• The outcome of Hebbian learning is a function of how well different inputs activate the unit, and how frequently they are presented

Self-Organizing Learning

• One detector can only represent one thing (i.e., pattern of correlated features)

• Goal: We want to present input patterns to the network and have different units in the network “specialize” for different things, such that each thing is represented by at least one unit

• Random weights (different initial receptive fields) and competition are important for achieving this goal

• What happens without competition ...

No Competition

livesunderwater


No Competition

livesunderwater

Without competition, all units end up representing the same “gang” of features; other, smaller correlations get ignored

wings beak feathers flies scaly slithers

Competition is important

livesunderwater



livesunderwater

inhibition



livesunderwater



striped orange sharpteeth

furry yellow chirps livesunderwater

When units have different initial “receptive fields” and they compete to represent input patterns, units end up representing different things

Hebbian Learning: Summary

• Hebbian learning finds the thing in the world that most reliably activates the unit, and tunes the unit to like that thing even more

• When:

• There are multiple hidden units competing to represent input patterns

• Each hidden unit starts out with a distinct receptive field

Then:

• Hebbian learning will tune these units so that each thing in the world (i.e., each cluster of correlated features) is represented by at least one unit

Problems with Penguins

slitherslives inAntarctica

waddleswings beak feathers flies



wings beak feathers flies waddles




inhibition

waddles

Problems with Hebb, and Possible Solutions

• Self-organizing Hebbian learning is capable of discovering the “high-level” (coarse) categorical structure of the inputs

• However, it sometimes collapses across more subtle (but important) distinctions, and the learning rule does not have any provisions for fixing these errors once they happen

Problems with Hebb, and Possible Solutions

• In the penguin problem, if we want the network to remember that typical birds fly, but penguins don’t, then penguins and typical birds need to have distinct (non-identical) hidden representations

• Hebbian learning assigns the same hidden unit to penguins and typical birds

• We need to supplement Hebbian learning with another learning rule that is sensitive to when the network makes an error (e.g., saying that penguins fly) and corrects the error by pulling apart the hidden representations of penguins vs. typical birds.

What is an error, exactly?

• One common way of conceptualizing error is in terms of predictions and outcomes

• If you give the network a partial version of a studied pattern, the network will make a prediction as to the missing features of that pattern (e.g., given something that has “feathers”, the network will guess that it probably flies)

• Later, you learn what the missing features are (the outcome). If the network’s guess about the missing features is wrong, we want the network to be able to change its weights based on the difference between the prediction and the outcome.

• Today, I will present the GeneRec error-driven learning rule developed by Randy O’Reilly.

Error-Driven Learning


wad-dles


Prediction phase:

• Present a partial pattern

• The network makes a guess about the missing features.



wings beak feathers flies wad-dles

Prediction phase:







wad-dles


wad-dles

Prediction phase:



Outcome phase:

• Present the full pattern

• Let the network settle






wad-dles

wad-dles

Prediction phase:



Outcome phase:

• Present the full pattern

• Let the network settle






wad-dles

wad-dles

• We now need to compare these two activity patterns and figure out which weights to change.

Motivating the Learning Rule

• The goal of error-driven learning is to discover an internal representation for the item that activates the correct answer.

• Basically, we want to find hidden units that are associated with the correct answer (in this case, “waddles”).

• The best way to do this is to examine how activity changes when “waddles” is clamped on during the “outcome” phase.

• Hidden units that are associated with “waddles” should show an increase in activity in the outcome (vs. prediction) phase.

• Hidden units that are not associated with “waddles” should show a decrease in activity in the outcome phase (because of increased competition from other units that are associated with “waddle”).

Motivating the Learning Rule

• Hidden units that are associated with “waddle” should show an increase in activity in the outcome (vs. prediction) phase.

• Hidden units that are not associated with “waddle” should show a decrease in activity in the outcome phase

• Here is the learning role:

• If a hidden unit shows increased activity (i.e., it’s associated with the correct answer), increase its weights to the input pattern

• If a hidden unit should decreased activity (i.e., it’s not associated with the correct answer), reduce its weights to the input pattern






wad-dles

wad-dles






wad-dles

wad-dles

• Hebb and error have opposite effects on weights here!

• Error increases the extent to which penguin is linked to the right-hand unit, whereas Hebb reinforced penguin’s tendency to activate the left-hand unit



wad-dles


Catastrophic Interference

• If you change the weights too strongly in response to “penguin”, then the network starts to behave like all birds waddle. New learning interferes with stored knowledge...

• The best way to avoid this problem is to make small weight changes, and to interleave “penguin” learning trials with “typical bird” trials

• The “typical bird” trials serve to remind the network to retain the association between wings/feathers/beak and “flies”...

Interleaved Training


wad-dles


Interleaved Training



Gradual vs. One-Trial Learning

• Problem: It appears that the solution to the catastrophic interference problem is to learn slowly.

• But we also need to be able to learn quickly!

Gradual vs. One-Trial Learning

• Put another way: There appears to be a trade-off between learning rate and interference in the cortical network

• Our claim is that the brain avoids this trade-off by having two separate networks:

• A slow-learning cortical network that gradually develops internal representations that support generalization, prediction, categorization, etc.

• A fast-learning hippocampal network that is specialized for rapid memorization (but does not support generalization, categorization, etc.)

CA3 CA1

Dentate Gyrus

Entorhinal Cortex input

Entorhinal Cortex output

lower-level cortex

hippo-campus

neo-cortex

Interactions Between Hippo and Cortex

• According to the Complementary Learning Systems theory (McClelland et al., 1995), hippocampus rapidly memorizes patterns of cortical activity.

• The hippocampus manages to learn rapidly without suffering catastrophic interference because it has a built-in tendency to assign distinct, minimally overlapping representations to input patterns, even when they are very similar. Of course this hurts its ability to categorize.

Interactions Between Hippo and Cortex

• The theory states that, when you are asleep, the hippocampus “plays back” stored patterns in an interleaved fashion, thereby allowing cortex to weave new facts and experiences into existing knowledge structures.

• Even if something just happens once in the real world, hippocampus can keep re-playing it to cortex, interleaved with other events, until it sinks in...

• Detailed theory:

• slow-wave sleep = hippo playback to cortex

• REM sleep = cortex randomly activates stored representations; this strengthens pre-existing knowledge and protects it against interference

Role of the Hippocampus



hippocampus

Role of the Hippocampus



hippocampus

waddles

Error-Driven Learning: Summary

• Error-driven learning algorithms are very powerful: So long as the learning rate is small, and training patterns are presented in an interleaved fashion, algorithms like GeneRec can learn internal representations that support good “pattern completion” of missing features.

• Error-driven learning is not meant to be a replacement for Hebbian learning: The two algorithms can co-exist!

• Hebbian learning actually improves the performance of GeneRec by ensuring that hidden units represent meaningful clusters of features

Error-Driven Learning: Summary

• Theoretical issues to resolve with error-driven learning: The algorithm requires that the network “know” whether you are in a “prediction” phase or an “outcome” phase, how does the network know this?

• For that matter, the whole “phases” idea is sketchy

• GeneRec based on “prediction/outcome” differences is not the only way to do error-driven learning...

• Backpropagation• Learning by reconstruction• Adaptive Resonance Theory (Grossberg & Carpenter)

Learning by Reconstruction

• Instead of doing error-driven learning by comparing predictions and outcomes, you can also do error-driven learning as follows:

• First, you clamp the correct, full pattern onto the network and let it settle.

• Then, you erase the input pattern and see whether the network can reconstruct the input pattern based on its internal representation

• The algorithm is basically the same, you are still comparing two phases...

Learning by Reconstruction


wad-dles


• Clamp the to-be-learned pattern onto the input and let the network settle





wad-dles

wad-dles

Learning by Reconstruction• Clamp the to-

be-learned pattern onto the input and let the network settle

• Next, wipe the input layer clean (but not the hidden layer) and let the network settle





wad-dles

wad-dles

Learning by Reconstruction• Compare

hidden activity in the two phases and adjust weights accordingly (i.e., if activation was higher with the correct answer clamped, increase weights; if activation was lower, decrease wts)

Adaptive Resonance Theory






MISMATCH!

waddles

Spreading Activation vs. Active Maintenance

• Spreading activation is generally very useful... it lets us make predictions/inferences/etc.

• But sometimes you just want to hold on to a pattern of activation without letting activation spread (e.g., a phone number, or a person’s name).

• How do we maintain specific patterns of activity in the face of distraction?


• As you will see in the “hands-on” part of the workshop, the networks we have been discussing are not very robust to noise/distraction.

• Thus, there appears to be another tradeoff:

• Networks that are good at generalization/prediction are lousy at holding on to phone numbers/plans/ideas in the face of distraction


• Solution: We have evolved a network that is optimized for active maintenance: Prefrontal cortex! This complements the rest of cortex, which is good at generalization but not so good at active maintenance.

• PFC uses isolated representations to prevent spread of activity...

• Evidence for isolated stripes in PFC

Tripartite Functional Organization

PC = posterior perceptual & motor cortex

FC = prefrontal cortex

HC = hippocampus and related structures

Tripartite Functional Organization

PC = incremental learning about the structure of the environment

FC = active maintenance, cognitive control

HC = rapid memorization

Roles are defined by functional tradeoffs…

Key Trade-offs

• Extracting what is generally true (across events) vs. memorizing specific events

• Inference (spreading activation) vs. robust active maintenance

Hands-On Exercises

• The goal of the hands-on part of the workshop is to get a feel for the kinds of representations that are acquired by Hebbian vs. error-driven learning, and for network dynamics more generally.

• Here is the network that we will be using:

• Activity constraints: Only 10% of hidden units can be strongly active at once; in the input layer, only one unit per row

• Think of each row in the input as a feature dimension (e.g., shape) and the units in that row are mutually exclusive features along that dimension (square, circle, etc.)

• This diagram illustrates the connectivity of the network:

• Each hidden unit is connected to 50% of the input units; there are also recurrent connections from each hidden unit to all of the other hidden units

• Weights are symmetric• Initial weight values were set randomly

• I trained up the network on the following 8 patterns:

• In each pattern, the bottom 16 rows encode prototypical features that tend to be shared across patterns within a category; the top 8 rows encode item-specific features that are unique to each pattern.

• Each category has 3 “typical” items and one “atypical” item• During training, the network studied typical patterns 90% of the

time and it studied atypical patterns 10% of the time

• To save time, the networks you will be using have been pre-trained on the 8 patterns (by presenting them repeatedly, in an interleaved fashion)

• For some of the simulations, you will be using a network that was trained with (purely) Hebbian learning

• For other simulations, you will be using a network that was trained with a combination of error-driven (GeneRec) and Hebbian learning. Training of this network use a three-phase design:

• First, there was a “prediction” (minus) phase where a partial pattern was presented

• Second, there was an “outcome” (plus) phase where the full version of the pattern was presented

• Finally, there was a nothing phase where the input pattern was erased (but not the hidden pattern)

• Error-driven learning occurred based on the difference in activity between the minus and plus patterns, and based on the differenced in activity between the plus and nothing patterns

• When you get to the computer room, the simulation should already be open on the computer (some of you may have to double-up, I think there are slightly fewer computers than students) and there will be a handout on the desk explaining what to do

• You can proceed at your own pace

• I will be there to answer questions (about the lecture and about the computer exercises) and my two grad students Ehren Newman and Sean Polyn will also be there to answer questions.

Your Helpers

Ehren Sean me

secrets of neural network models ken norman princeton university july 24, 2003 note: these slides...

Documents

neural networks

workshop slide

task performance slide

principles of learning

brain changes

sean polyn slide

simple learning rules

neocortical networks