chapter 6 associative models

50
Chapter 6 Associative Models

Upload: others

Post on 03-Jan-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Chapter 6Associative Models

Introduction

• Associating patterns which aresimilar– similar,

– contrary, – in close proximity (spatial), p y ( p ),– in close succession (temporal)– or in other relations

• Associative recall/retrieve – evoke associated patterns

recall a pattern b part of it: pattern completion– recall a pattern by part of it: pattern completion– evoke/recall with noisy patterns: pattern correction

• Two types of associations For two patterns s and tTwo types of associations. For two patterns s and t– hetero-association (s != t) : relating two different patterns– auto-association (s = t): relating parts of a pattern with

other parts

• Example– Recall a stored pattern by a noisy p y y

input pattern– Using the weights that capture the

i iassociation– Stored patterns are viewed as

“attractors” each has itsattractors , each has its “attraction basin”

– Often call this type of NN yp“associative memory” (recall by association, not explicit i d i / dd i )indexing/addressing)

A hi f NN i i• Architectures of NN associative memory– single layer: for auto (and some hetero) associations

two layers: for bidirectional associations– two layers: for bidirectional associations• Learning algorithms for AM

– Hebbian learning rule and its variationsg– gradient descent– Non-iterative: one-shot learning for simple

associations– Iterative: for better recalls

• Analysis• Analysis– storage capacity (how many patterns can be

remembered correctly in AM, each is called a memory)remembered correctly in AM, each is called a memory)– learning convergence

Training Algorithms for Simple AM• Network structure: single layer

– one output layer of non-linear units and one input layer w11 y1x1

wm,1

ip,1 dp,1

ymxn

w1,n

wm,nip,n dp,n

• Goal of learning: – to obtain a set of weights W = wj ito obtain a set of weights W wj,i– from a set of training pattern pairs

– such that when ip is applied to the input layer, dp is computed at the output layer, e.g.,f ll t i i i (i d ) ( )d W i d f W ifor all training pairs (ip, dp): or ( )p p p pd W i d f W i

Hebbian rule• Algorithm: (bipolar patterns)

Si f ti f t t d

ebb u e

– Sign function for output nodes:

– For each training samples (ip, dp):–

jpkpkj diw ,,,

sign same thehave and both if increases ,,, pkjpjk diw

• P diw

,If 0 initiall. Then, after updates for all training patternsj kw P

# of training pairs in which ip and dp have the same sign minus # of training pairs in which i and d have different signs

P jpkpkj diw 1 ,,,

# of training pairs in which ip and dp have different signs.– Instead of obtaining W by iterative updates, it can be

computed from the training set by summing the outer product of ip and dp over all P samples.

Associative Memories• Compute W as the sum of outer products of all training pairs

(i d )

ssoc ve e o es

(ip, dp)

,1 ,1 ,1 ,,1 , ,p p p p npP P P

T

d i d idd i d i

,2 ,1 ,2 ,, ,1 ,

1 1 1

1

, ,, ,

, ,

T p p p p nm n p p p p n

p p p

p m p m p p m p n

d i d iW d i i i

d d i d i

note: outer product of two vectors is a matrix

ith row is the weight vector for ith output node

, , ,1 , ,, ,p m p m p p m p n

• It involves 3 nested loops p, k, j, (order of p is irrelevant) p = 1 to P /* for every training pair */

j 1 t /* f i W */j = 1 to m /* for every row in W */k = 1 to n /* for every element k in row j */

jpkpkjkj diww ,,,, : jppjj ,,,,

• Does this method provide a good association?– Recall with training samples (after the weights are

l d d)learned or computed)– Apply il to one layer, hope dl appears on the other, e.g.

– May not always succeed (each weight contains some i f i f ll l )

ll diW )(sign

information from all samples)

P

lTppl

PTppl iidiidiW

11)()(

lp

lTppl

Tll

pp

iidiid11

lp

lTppll

p

iidid 2

cross-talkprincipal cross talk term

principal term

• Principal term gives the association between i and d• Principal term gives the association between il and dl . • Cross-talk represents interference between (il, dl) and other

training pairs When cross talk is large i will recalltraining pairs. When cross-talk is large, il will recall something other than dl.

• If all sample input i are orthogonal to each other then weIf all sample input i are orthogonal to each other, then we have , no sample other than (il, dl) contribute to the result (cross-talk = 0).

0 lTp ii( )

• There are at most n orthogonal vectors in an n-dimensional space.

• Cross-talk increases when P increases.• How many arbitrary training pairs can be stored in an AM?

– Can it be more than n (allowing some non-orthogonal patterns while keeping cross-talk terms small)?Storage capacity (more later)– Storage capacity (more later)

Example of hetero-associative memory

• Binary pattern pairs i:d with |i| = 4 and |d| = 2.

p y

• Net weighted input to output units:• Activation function: threshold

k

kjkj wxy ,

0001

)(j

jj yif

yifyS

• Weights are computed by Hebbian rule (sum of outer products of all training pairs)

P

• 4 training samples:

P

p

Tpp idW

1

• 4 training samples:

00011 T i i S l

0000

0001000101

11Tid

Training Samplesip dp

p=1 (1 0 0 0) (1, 0)

0000

0011001101

22Tid

p ( ) ( , )p=2 (1 1 0 0) (1, 0)p=3 (0 0 0 1) (0, 1)

4 (0 0 1 1) (0 1)

1000

0000100010

33Tid

p=4 (0 0 1 1) (0, 1)

1100

0000110010

44Tid

2100

0012WComputing the weights

recallT i i S l• 4 training inputs have correct recall

For example: x = (1 0 0 0)Training Samples

ip dpp=1 (1 0 0 0) (1, 0)

)02(001

21000012

y

p ( ) ( , )p=2 (1 1 0 0) (1, 0)p=3 (0 0 0 1) (0, 1)

4 (0 0 1 1) (0 1)

recallcorrect a ,)01()(0

yS

p=4 (0 0 1 1) (0, 1)

• x=(0 1 0 0) (similar to i1 and i2 )

• x=(0 1 1 0) (not sufficiently similar to any training input)

)01(010

21000012y

any training input)

10

0012

1 recalls ,)01()(

)(

002100

iyS

y

d)11()(

)11(

011

21000012

S

y

1patternstoredanot ,)11()( yS

Example of auto-associative memoryp y• Same as hetero-assoc nets, except dp = ip for all p = 1,…, P• Used to recall a pattern by a its noisy or incomplete versionUsed to recall a pattern by a its noisy or incomplete version.

(pattern completion/pattern recovery)• A single pattern i = (1, 1, 1, -1) is stored (weights computed

b bbi l d )by Hebbian rule – outer product) 1 1 1 11 1 1 1

1 1 1 1

1 1 1 11 1 1 1

W

• Recall by

training pat. 1 1 1 1 4 4 4 4 1 1 1 1W

noisy pat 1 1 1 1 2 2 2 2 1 1 1 1missing info 0 0 1 1 2 2 2 2 1 1 1 1

i 1 1 1 1 0 0 0 0 t i d

WWW

more noisy 1 1 1 1 0 0 0 0 not recognizedW

Al t i t i• Always a symmetric matrix• Diagonal elements (∑p(ip,k)2 will dominate the computation

when large number of patterns are storedwhen large number of patterns are stored .• When P is large, W is close to an identity matrix. This

causes recall output = recall input which may not be anycauses recall output recall input, which may not be any stoned pattern. The pattern correction power is lost.

• Replace diagonal elements by zero• Replace diagonal elements by zero. 0 1 1 1 1 0 1 1' 1 1 0 1W

1 1 0 11 1 1 0

' (1 1 1 1) (3 3 3 3) (1 1 1 1)W (1 1 1 1) (3 3 3 3) (1 1 1 1)' ( 1 1 1 1) (3 1 1 1) (1 1 1 1)' (0 0 1 1) (2 2 1 1) (1 1 1 1)' ( 1 1 1 1) (1 1 1 1) ???

WWWW

' ( 1 1 1 1) (1 1 1 1) ???W

Storage Capacity• # of patterns that can be correctly stored & recalled by a

network.• More patterns can be stored if they are not similar to each

other (e.g., orthogonal)non-orthogonal

(1 1 1 1) 0 0 2 20 0 0 0W

(1 1 1 1) (1 0 1 1)W

orthogonal

( ) (1 1 1 1) 0

0 0 0 02 0 0 2

2 0 2 0

W

(1 1 1 1) (1 0 1 1)It is not stored correctlyW

orthogonal

)1111()1111(

0 1 1 11 0 1 1W

becan patterns threeAll

)1111( )1111(

0 1 1 0 11 1 1 0

W

recalledcorrectly

• Adding one more orthogonal pattern (1 1 1 1), the weightAdding one more orthogonal pattern (1 1 1 1), the weight matrix becomes:

0000

000000000000

W The memory is completely destroyed!!!

• Theorem: an n n network is able to store up to n –1

0000 completely destroyed!!!

Theorem: an n n network is able to store up to n 1mutually orthogonal (M.O.) bipolar vectors of n-dimension, but not n such vectors.

Delta Rule• Suppose output node function S is differentiable• Minimize square error

PySdE 2))((Minimize square error

• Derive weight update rule by gradient descent approachE

p

pp ySdE1

))((

• This works for arbitrary pattern mapping, kpjpjpjp

jkjk iySySd

wEw ,,,, )('))((

y p pp g,– Similar to Adaline– May have better performance than strict Hebbian rule

Least Square (Widrow-Hoff) Rule

• Also minimizes square error with step/sign node functionsDi tl t th i ht t i Ω f• Directly computes the weight matrix Ω from

– I : matrix whose columns are input patterns ip

D : matri hose col mns are desired o tp t patterns d– D : matrix whose columns are desired output patterns dp

• Since E is a quadratic function, it can be minimized by Ωthat is the solution to the following systems of equations

• This leads to• This leads to• Normalized Hebbian: Ω = DIT / IIT.• When is invertible E will be minimized• When is invertible, E will be minimized• If is not invertible, it always has a unique pseudo

inverse the weight matrix can then be computed asinverse, the weight matrix can then be computed as

• When all sample input patterns are orthogonal it reduces to• When all sample input patterns are orthogonal, it reduces to W =

• Not work with auto association: since D = I Ω = IIT / IIT• Not work with auto association: since D = I, Ω = IIT / IIT

becomes identity matrix

• Follow up questions:– What would be the capacity of AM if stored patterns

are not mutually orthogonal (say random)– Ability of pattern recovery and completion.

How far off a pattern can be from a stored pattern that is still able to recall a correct/stored pattern

– Suppose x is a stored pattern, input x’ is close to x, and x”= S(Wx’) is even closer to x than x’. What should we do?should we do?

• Feed back x” , and hope iterations of feedback will lead to x?lead to x?

Iterative Autoassociative Networks• Example:

11011110

011110111101)1,1,1,1( Wx Output units are

threshold units

xWxx

")1,1,1,0(')0,0,0,1(' :input recall incompleteAn

• In general: using current output as input of the next iteration

xWx )1,1,1,1()2,2,2,3("),,,(

• In general: using current output as input of the next iterationx(0) = initial recall inputx(I) = S(Wx(I-1)) I = 1 2x(I) S(Wx(I-1)), I 1, 2, ……until x(N) = x(K) for some K < N

D i S t t t t (I)• Dynamic System: state vector x(I)– If K = N-1, x(N) is a stable state (fixed point)

f(Wx(N)) f(Wx(N 1)) x(N)f(Wx(N)) = f(Wx(N-1)) = x(N)• If x(K) is one of the stored pattern, then x(K) is called a

genuine memoryg y• Otherwise, x(K) is a spurious memory (caused by cross-

talk/interference between genuine memories)E h fi d i t ( i i ) i• Each fixed point (genuine or spurious memory) is an attractor (with different attraction basin)

– If K != N-1, limit-circle, , ,• The network will repeat

x(K), x(K+1), …..x(N)=x(K) when iteration continues.• Iteration will eventually stop because the total number of

distinct state is finite (3^n) if threshold units are used.– If patterns are continuous the system may continue evolve– If patterns are continuous, the system may continue evolve

forever (chaos) if no such K exists.

My Own Work: Turning BP net for Auto AMO ibl f th ll• One possible reason for the small capacity of HM is that it does not have hidden nodes.T i f d f d t k ( ith

output

• Train feed forward network (with hidden layers) by BP to establish pattern auto-associative.

ll f db k h i input

hidden

• Recall: feedback the output to input layer, making it a dynamic system.

• Shown 1) it will converge, and 2)

input

A t i tistored patterns become genuine attractors.

• It can store many more patterns

Auto-association

(seems O(2^n))• Its pattern complete/recovery

capability decreases when n increases

output1

hidd 2

output2

p y(# of spurious attractors seems to increase exponentially)

• Call this model BPRR input1

hidden1

input2

hidden2

p p

Hetero-association

• Examplen = 10 network is (10 20 10)– n = 10, network is (10 – 20 – 10)

– Varying # of stored memories ( 8 – 128)– Using all 1024 patterns for recall, correct if one of the stored

memories is recalledmemories is recalled– Two versions in preparing training samples

• (X, X), where X is one of the stored memoryS l d i h (X’ X) h X’ i i i f X• Supplemented with (X’, X) where X’ is a noisy version of X

# stored correct recall w/o correct recall # spurious memories relaxation with relaxation attractors

8 202 (835) 861 (1024) 6 (0)

16 49 (454) 341 (980) 60 (5)16 49 (454) 341 (980) 60 (5)

32 39 (371) 191 (928) 197 (17)

64 65 (314) 126 (662) 515 (144)64 65 (314) 126 (662) 515 (144)

128 128 (351) 136 (561) 825 (254)

Numbers in parentheses are for learning with supplementary samples (X’, X)

Hopfield Models• A single layer network with full connection

each node as both input and output units– each node as both input and output units– node values are iteratively updated, based on the

weighted inputs from all other nodes, until stabilizedweighted inputs from all other nodes, until stabilized• More than an AM

– Other applications e g combinatorial optimizationOther applications e.g., combinatorial optimization• Different forms: discrete & continuous• Major contribution of John Hopfield to NN• Major contribution of John Hopfield to NN

– Treating a network as a dynamic systemIntroduced the notion of energy function and attractors– Introduced the notion of energy function and attractors into NN research

Discrete Hopfield Model (DHM) as AMDiscrete Hopfield Model (DHM) as AM• Architecture:

– single layer (units serve as both input and output)– nodes are threshold units (binary or bipolar)( y p )– weights: fully connected, symmetric, often zero diagonal

jiij ww

– External inputs

0ii

jj

w

pmay be transientor permanentor permanent

• Weights:• Weights:– To store patterns ip, p = 1,2,…P

bipolar: 1 jkiiwbipolar:

0

,,

kk

pjpkpjk

w

jkiiP

w

same as Hebbian rule (with zero diagonal)binary: )12)(12( ,, jpkpjk jkiiw

0kk

p

w

converting ip to bipolar when constructing W.• Recall

U i ll d– Use an input vector to recall a stored vector– Each time, randomly select a unit for update – Periodically check for convergencePeriodically check for convergence

• Notes:• Notes:1.Theoretically, to guarantee convergence of the recall

process (avoid oscillation) only one unit is allowed toprocess (avoid oscillation), only one unit is allowed to update its activation at a time during the computation (asynchronous model). ( y )

2.However, the system may converge faster if all units are allowed to update their activations at the same time (synchronous model).

3.Each unit should have equal probability to be selected4.Convergence test:

kcurrentxpreviousx kk )()( kk

• Example:– A 4 node network stores 2 patterns (1 1 1 1) and (-1 -1 -1 -1)A 4 node network, stores 2 patterns (1 1 1 1) and (-1 -1 -1 -1)– Weights:

• Corrupted input pattern: (1 1 1 -1)Node outputselection patternnode 2: (1 1 1 -1)node 2: (1 1 1 1)node 4: (1 1 1 1)No more change of state will occur, the correct pattern is recovered

• Equaldistance: (1 1 -1 -1)node 2: net = 0, no change (1 1 -1 -1)g ( )node 3: net = 0, change state from -1 to 1 (1 1 1 -1)node 4: net = 0, change state from -1 to 1 (1 1 1 1)N h f ill h i dNo more change of state will occur, the correct pattern is recoveredIf a different tie breaker is used (if input = 0, output -1), the stored

pattern (-1 -1 -1 -1) will be recalledIn more complicated situations, different order of node selections

may lead the system to converge to different attractors.

• Missing input element: (1 0 -1 -1)Missing input element: (1 0 1 1)Node outputselection patternnode 2: (1 -1 -1 -1)node 1: net = -3, change state to -1 (-1 -1 -1 -1)No more change of state will occur the correct pattern is recoveredNo more change of state will occur, the correct pattern is recovered

• Missing input element: (0 0 0 -1)g p ( )the correct pattern (-1 -1 -1 -1) is recoveredThis is because the AM has only 2 attractors y

(1 1 1 1) and (-1 -1 -1 -1)When spurious attractors exist (with more memories), pattern completion may be incorrect (been attracted to wrong, spurious attractors).

Convergence Analysis of DHM• Two questions:

1. Will Hopfield AM converge (stop) with any given recall input?2. Will Hopfield AM converge to the stored pattern that is closest

to the recall input ?H fi ld id t th fi t ti• Hopfield provides answer to the first question– By introducing an energy function to this model, – No satisfactory answer to the second question so farNo satisfactory answer to the second question so far.

• Energy function:– Notion in thermo-dynamic physical systems. The system has aNotion in thermo dynamic physical systems. The system has a

tendency to move toward lower energy state.– Also known as Lyapunov function in mathematics. After

Lyapunov theorem for the stability of a system of differentialLyapunov theorem for the stability of a system of differential equations.

• In general, the energy function is the state of the system at step (time) t, must satisfy two conditions

)( where)),(( txtxEy p ( ) y

1. is bounded from below2. is monotonically non-increasing with time.)(tE

tctE )()(tE

– Therefore, if the system’s state change is associated with such

an energy function its energy will continuously be reduced

)0)(:versioncontinuous(in 0)()1()1( tEtEtEtE

an energy function, its energy will continuously be reduced until it reaches a state with a (locally) minimum energy

– Each (locally) minimum energy state is an attractor– Hopfield shows his model has such an energy function, the

memories (patterns) stored in DHM are attractors (other attractors are spurious)p )

– Relations to gradient descent approach

• The energy function for DHM: – Assume the input vector is close to one of the attractors

often with a = ½, b = 1

Since all values in E are finite, E is finite and thus bounded from belowfrom below

• Convergenceg– let kth node is updated at time t, the system energy change is

– When choosing a = ½, b = 1

Since state change (from -1 to +1 or +1 to -1) depends on if g ( ) pnetk >= 0 or < 0, we have

And therefore, < 0

• Example:• Example:– A 4 node network, stores 3 patterns

(1 1 -1 -1), (1 1 1 1) and (-1 -1 1 1)– Weights:

3/13/1013/13/110

013/13/1103/13/1

• Corrupted input pattern: (-1 -1 -1 -1)– If node 4 is selected:

( 1/3 1/3 1 0) ( 1 1 1 1) + ( 1) = 1/3 + 1/3 1 1 = 4/3(-1/3 -1/3 1 0) (-1 -1 -1 -1) + (-1) = 1/3 + 1/3 –1 –1 = -4/3, – No change of state for node 4– Same for all other nodes, net stabilized at (-1 -1 -1 -1)– A spurious state/attractor is recalled

• For input pattern (-1 -1 -1 0)For input pattern ( 1 1 1 0)– If node 4 is selected first,

(-1/3 -1/3 1 0) (-1 -1 -1 0) + (0) = 1/3 + 1/3 – 1 – 0 – 0 = –1/3, change state to -1, then same as in the previous example,network stabilized at (-1 -1 -1 -1)If the node 3 is selected before 4 and if the input is transient– If the node 3 is selected before 4 and if the input is transient, the net will stabilized at state (-1 -1 1 1), a genuine attractor

3/13/1013/13/110

-1 -1 -1 0-1 -1 1 0-1 -1 1 1

013/13/1103/13/1

3/13/101W-1 -1 1 1-1 -1 1 1-1 -1 1 1-1 -1 1 1-1 -1 1 1

• Comments:1.Why converge.

E h i E i i h h d d• Each time, E is either unchanged or decreases an amount.• E is bounded from below. • There is a limit E may decrease. After finite number of steps, y p ,

E will stop decrease no matter what unit is selected for update2.The state the system converges to is a stable state.

Will t t thi t t ft ll t b ti It i ll dWill return to this state after some small perturbation. It is called an attractor (with different attraction basin)

3.Error function of BP learning is another example ofg penergy/Lyapunov function. Because• It is bounded from below (E>0)• It is monotonically non-increasing (W updates along gradient

descent of E)

Capacity Analysis of DHM• P: maximum number of random patterns of dimension n

can be stored in a DHM of n nodesP• Hopfield’s observation:

• Theoretical analysis:15.0,15.0

nPnP

nnnPC

2log21

ln41capacity

P/n decreases when n increases because larger n leads to more interference between stored patterns (stronger cross-talks).cross talks).

• Some work to modify HM to increase its capacity to close to n, W is trained (not computed by Hebbian rule).

• Another limitation: full connectivity leads to excessive connections for patterns with large dimensions

Continuous Hopfield Model (CHM)• Different (the original) formulation than the text• Architecture:Architecture:

– Continuous node output, and continuous time– Fully connected with symmetric weights 0, iijiij wwwFully connected with symmetric weights

– Internal activation

0, iijiij www

)()()(with :1

tnettxwdt

tduun

jiijij

ii

– Output (state)where f is a sigmoid function to ensure binary/bipolar

))(()( tuftx ii

output. E.g. for bipolar, use hyperbolic tangent functionxx eef

)h()( xx eeeexxf

)tanh()(

Continuous Hopfield Model (CHM)• Computation: all units change their output (states) at the

same time, based on states of all others.

– Compute net:

Compute internal activation by first order Taylor)(tu

n

jijiji txwtnet

1)()(

– Compute internal activation by first-order Taylor expansion

)(tui

( )( ) ( ) ( ) ( )idu tt t t dt t t t – Compute output

( )( ) ( ) ( ) ( )ii i i i iu t net t dt u t u t net

dt

))(()( tuftx ii

• Convergence: d fi f ti– define an energy function,

– show that if the state update rule is followed, the system’s energy always decreasing by showing derivative 0/ dtdEy g y g

12 i ij j i i

ij i

E x w x x

(chain rule, depends on on )i i

i i i

dx dudE E x u, u tdt x du dt

, ( appears twice in )j i j i i ij ii

E w x net x Ex

'( ) 0, ii

i

dx f udud

ii

du netdt

fdE 0))(('Th f 2 i

ii netufdtd 0))(('Therefore, 2

asymptotically approaches zero whenE ( )f u– asymptotically approaches zero when approaches 1 or 0 (-1 for bipolar) for all i.

– The system reaches a local minimum energy state

E ( )if u

The system reaches a local minimum energy state– Gradient descent:

Instead of jumping from corner to corner in aii xEdtdu //

– Instead of jumping from corner to corner in a hypercube as the discrete HM does, the system of continuous HM moves in the interior of the hypercube ypalong the gradient descent trajectory of the energy function to a local minimum energy state.

Bidirectional AM(BAM)• Architecture:

– Two layers of non-linear units: X(1)-layer, X(2) l f diff di iX(2)-layer of different dimensions

– Units: discrete threshold, continuous sigmoid (can be either binar or bipolar)(can be either binary or bipolar).

• Weights:H bbi l– Hebbian rule:

• Recall: bidirectional

• Analysis (discrete case)• Analysis (discrete case)– Energy function: (also a Lyapunov function)

(1) ( 2 ) ( 2 ) (1) (1) ( 2 )0 5( )T T T TL X WX X W X X WX(1) ( 2 ) ( 2 ) (1) (1) ( 2 )

(1) ( 2 )

0.5( )T T T T

m n

k j l j

L X WX X W X X WX

x w x

• The proof is similar to DHM

H ld f b h h d h

,1

k j l jj i k

x w x

• Holds for both synchronous and asynchronous update (holds for DHM only with asynchronous update due to lateral connections )update, due to lateral connections.)

– Storage capacity: )),(max( mnO