self organization: hebbian learning cs/cmpe 333 – neural networks

Self Organization: Hebbian Learning

CS/CMPE 333 – Neural Networks

CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS 2

Introduction

So far, we have studied neural networks that learn from their environment in a supervised manner

Neural networks can also learn in an unsupervised manner as well. This is also known as self organized learning

Self organized learning discovers significant features or patterns in the input data through general rules that operate locally

Self organizing networks typically consist of two layers with feedforward connections and elements to facilitate ‘local’ learning


Self-Organization

“Global order can arise from local interactions” – Turing (1952)

Input signal produces certain activity patterns in network <-> weights are modified (feedback loop)

Principles of self organization

1. Modification in weights tend to self-amplify

2. Limitation of resources leads to competition and selection of the most active synapse and disregard of less active synapse

3. Modifications in weights tends to cooperate


Hebbian Learning

A self-organizing principle was proposed by Hebb in 1949 in the context of biological neurons

Hebb’s principle When a neuron repeatedly excites another neuron, then the

threshold of the latter neuron is decreased, or the synaptic weight between the neurons is increased, in effect increasing the likelihood of the second neuron to excite

Hebbian learning rule

Δwji = ηyjxi There is no desired or target signal required in the Hebbian

rule, hence it is unsupervised learning The update rule is local to the weight


Hebbian Update

Consider the update of a single weight w (x and y are the pre- and post-synaptic activities)

w(n + 1) = w(n) + ηx(n)y(n) For a linear activation function

w(n + 1) = w(n)[1 + ηx2(n)] Weights increase without bounds. If initial weight is

negative, then it will increase in the negative. If it is positive, then it will increase in the positive range

Hebbian learning is intrinsically unstable, unlike error-correction learning with BP algorithm


Geometric Interpretation of Hebbian Learning

Consider a single linear neuron with p inputs

y = wTx = xTw

and

Δw = η[x1y x2y … xpy]T

The dot product can be written as

y = |w||x| cos(α) α = angle between vectors x and w If α is zero (x and w are ‘close’) y is large. If α is 90 (x and w

are ‘far’) y is zero.


Similarity Measure

A network trained with Hebbian learning creates a similarity measure (the inner product) in its input space according to the information contained in the weights The weights capture (memorizes) the information in the data

during training

During operation, when the weights are fixed, a large output y signifies that the present input is "similar" to the inputs x that created the weights during training

Similarity measures Hamming distance Correlation


Hebbian Learning as Correlation Learning

Hebbian learning (pattern-by-pattern mode)Δw(n) = ηx(n)y(n) = ηxT(n)x(n)w(n)

Using batch mode

Δw(n) = η[Σn=1 Nx(n)xT(n)]w(0) The term Σn=1 Nx(n)xT(n) is sample approximation of

the auto-correlation of the input data Thus Hebbian learning can be thought of learning the auto-

correlation of the input space Correlation is a well-known operation in signal

processing and statistics. In particular, it completely describes signals defined by Gaussian distributions Applications in signal processing


Oja’s Rule

The simple Hebbian rule causes the weights to increase (or decrease) without bounds

The weights need to be normalized to one as

wji(n + 1) = [wji(n) + ηxi(n)yj(n)] /

√Σi[wji(n) + ηxi(n)yj(n)]2 This equation effectively imposes a constraint on the weights

that the sum at a neuron be equal to 1 Oja approximated the normalization (for small η) as:

wji(n + 1) = wjin) + ηyj(n)[xi(n) – yj(n)wji(n)] This is Oja’s rule, or the generalized Hebbian rule It involves a ‘forgetting term’ that prevents the weights from

growing without bounds


Oja’s Rule – Geometric Interpretation

The simple Hebbian rule finds the weight vector with the largest variance with the input data. However, the magnitude of the weight vector increases

without bounds

Oja’s rule has a similar interpretation; normalization only changes the magnitude while the direction of the weight vector is same Magnitude is equal to one

Oja’s rule converges asymptotically, unlike Hebbian rule which is unstable


The Maximum Eigenfilter

A linear neuron trained with Oja’s rule produces a weight vector that is the eigenvector of the input auto correlation matrix, and produces at its output the largest eigenvalue

A linear neuron trained with Oja’s rule solves the following eigen problem

Re1 = λ1e1 R = auto-correlation matrix of input data e1 = largest eigenvector which corresponds to the weight

vector w obtained by Oja’s rule λ1 = largest eigenvalue, which corresponds to the network’s

output


Principal Component Analysis (1)

Oja’s rule when applied to a single neuron creates a principal component in the input space in the form of the weight vector

How can we find other components in the input space with significant variance ?

In statistics, PCA is used to obtain the significant components of data in the form of orthogonal principal axes PCA is also known as K-L filtering in signal processing First proposed in 1901. Later developments occurred in the

1930s, 1940s and 1960s.

Hebbian network with Oja’s rule can perform PCA


Principal Component Analysis (2)

PCA Consider a set of vectors x with zero mean and unit

variance. There exist an orthogonal transformation y = QTx such that the covariance matrix of y is Λ = E[yyT]

Λij = λi if i = j and Λij = 0 otherwise (diagonal matrix) λ1 > λ2 > … > λp = eigenvalues of covariance matrix of x (C

= E[xxT] Columns of Q are the corresponding eigenvectors Vector y is the principal component that has the maximum

variance with all other components


PCA – Example


Hebbian Network for PCA


Hebbian Network for PCA

Procedure Use Oja’s rule to find the principal component Project the data orthogonal to the principal component Use Oja’s rule on the projected data to find the next major

component Repeat the above for m <= p (m = desired components; p =

input space dimensionality) How to find the projection onto orthogonal direction?

Deflation method: subtract the principal component from the input

Oja’s rule can be modified to perform this operation; Sanger’s rule


Sanger’s Rule

Sanger’s rule is a modification of Oja’s rule that implements the deflation method for PCA Classical PCA involves matrix operations Sanger’s rule implements PCA in an iterative fashion for

neural networks

Consider p inputs and m outputs, where m < p

yj(n) = Σi=1 p wji(n)xi(n) j = 1, m

and, the update (Sanger’s rule)

Δwji(n) = η[yj(n)xi(n) – yj(n) Σk=1 j wki(n)yk(n)]


PCA for Feature Extraction

PCA is the optimal linear feature extractor. This means that there is no other linear system that is able to provide better features for reconstruction. PCA may or may not be the best preprocessing for pattern classification or recognition. Classification requires good discrimination which PCA might not be able to provide.

Feature extraction: transform p-dimensional input space to an m-dimensional space (m < p), such that the m-dimensions capture the information with minimal loss

The error e in the reconstruction is given by:

e2 = Σi=M+1 p λi


PCA for Data Compression

PCA identifies an orthogonal coordinate system for the input data such that the variance of the projection on the principal axis is largest, followed by the next major axis, and so on

By discarding some of the minor components, PCA can be used for data compression, where a p-dimension (bit) input is encoded in a m < p dimensional space

Weights are computed by Sanger’s rule on typical inputs

The de-compressor (receiver) must know the weights of the network to reconstruct the original signalx’ = WTy


PCA for Classification (1)

Can PCA enhance classification ? In general, no. PCA is good for reconstruction and not

feature discrimination or classification


PCA for Classification (2)


PCA – Some Remarks

Practical uses of PCA Data Compression Cluster analysis Feature extraction Preprocessing for classification/recognition (e.g.

preprocessing for MLP training)

Biological basis It is unlikely that the processing performed by biological

neurons in, say perception, involves PCA only. More complex feature extraction processes are involved.


Anti-Hebbian Learning

Modifying the Hebbian rule as

Δwji(n) = - ηxi(n)yj(n)

The anti-Hebbian rule find the direction in space that has the minimum variance. In other words, it is the complement of the Hebbian rule Anti-Hebbian does de-correlation. It de-correlates the output

from the input

Hebbian rule is unstable, since it tries to maximize the variance. Anti-Hebbian rule, on the other hand, is stable and converges

self organization: hebbian learning cs/cmpe 333 – neural networks

Documents