modular neural networks ii presented by: david brydon karl martens david pereira cpsc 533 -...

Modular Neural Networks II

Presented by:David BrydonKarl MartensDavid Pereira

CPSC 533 - Artificial Intelligence Winter 2000Instructor: C. JacobDate: 16-March-2000

Presentation Agenda

A Reiteration Of Modular Neural Networks Hybrid Neural Networks Maximum Entropy Counterpropagation Networks Spline Networks Radial Basis Functions

Note: The information contained in this presentation has been obtained from Neural Networks: A Systematic Introduction by R. Rojas.

A Reiteration of Modular Neural Networks

There are many different types of neural networks - linear, recurrent, supervised, unsupervised, self-organizing, etc. Each of these neural networks have a different theoretical and practical approach.

However, each of these different models can be combined.

How ? Each of the afore-mentioned neural networks can be transformed into a module that can be freely intermixed with modules of other types of neural networks.

Thus, we have Modular Neural Networks.

A Reiteration of Modular Neural Networks

But WHY do we have Modular Neural Network Systems ?

To Reduce Model Complexity To Incorporate Knowledge To Fuse Data and Predict Averages To Combine Techniques To Learn Different Tasks Simultaneously To Incrementally Increase Robustness To Emulate Its Biological Counterpart

Hybrid Neural Networks

A very well-known and promising family of architectures was developed by Stephen Grossberg.

It is called ART - Adaptive Resonance Theory. It is closer to the biological paradigm than feed-forward networks or standard associative memories. The dynamics of the networks resembles learning in humans. One-shot learning can be recreated with this model.

There are three different architectures in this family: ART-1: Uses Boolean values ART-2: Uses real values ART-3: Uses differential equations


Each category in the input space is represented by a vector.

The ART networks classify a stochastic series of vectors into clusters.

All vectors located inside the cone around each weight vector are considered members of a specific cluster.

Each unit fires only for vector located inside it associated ‘cone’ of radius ‘r’.

The value ‘r’ is inversely proportional to the attention parameter of the unit.

Large ‘r’ means classification of the input space is fine.

Small ‘r’ means classification of the input space is coarse.


Fig. 1. Vector clusters and attention parameters


Once the weight vectors have been found, the network computes whether new data can or cannot be classified by the existing clusters.

If not, a new a new cluster is created with a new associated weight vector.

ART networks have two major advantages: Plasticity: it can always react to unknown inputs (by creating a new cluster with a new weight vector, if the given input cannot be classified by existing clusters). Stability: Existing clusters are not deleted by the introduction of new inputs (New clusters will just be created in addition to the old ones).

However, enough potential weight vectors must be provided.


Fig. 2. The ART-1 Architecture


The Structure of ART-1 (Part 1 of 2):

There are two basic layers of computing units.

Layer F1 receives binary input vectors from the input sites.

As soon as an input vector arrives it is passed to layer F1 and from there to layer F2.

Layer F2 contains elements which fire according to the “winner-takes-all” method. (Only the element receiving the maximal scalar product of its weight vector and input vector fires).

When a unit in layer F2 has fired, the negative weight turns off the attention unit. Also, the winning unit in layer F2 sends back a 1 throughout the connection between layer F2 and F1.

Now each unit in layer F1 becomes as input the corresponding component of the input vector x and of the weight vector w.


The Structure of ART-1 (Part 2 of 2):

The i-th F1 unit compares xi with wi and outputs the product xiwi.

The reset unit receives this information and also the components of x, weighted by p, the attention parameter so that its own computation is

p (x1+x2+…+xn) - x.w 0 which is the same as

(x.w) / (x1+x2+…+xn) p

The reset unit fires only if the input lies outside the attention cone of the winning unit. A reset signal is sent to layer F2, but only the winning layer is inhibited.

This is turns activates the attention unit and a new round of computation begins. Hence, there is resonance.


The Structure of ART-1 (Some Final Details):

The weight vectors in layer F2 are initialized with all components equal to 1 and p is selected to satisfy 0<p<1. This ensures that eventually an unused vector will be recruited to represent a new cluster.

The selected weight vector w is updated by pulling it in the direction of x. This is done in ART-1 by turning of all component in w which are zeros in x.

The purpose of the reset signal is to inhibit all units that do not resonate with the input. A unit in layer F2, which is still unused, can be selected for the new cluster containing x. In this way, sufficiently different input data can create a new cluster. By modifying the value of the attention parameter p, we can control the number of clusters and how wide they are.


The Structure of ART-2 and ART-3

ART-2 uses vectors that have real-valued components instead of Boolean components.

The dynamics of the ART-2 and ART-3 models is governed by differential equations.

However, computer simulations consume too much time.

Consequently, implementations using analog hardware or a combination of optical and electronic elements are more suited to this kind of model.


Maximum entropySo what’s the problem with ART ? It tries to build clusters of the same size, independently of the distribution data.

So, is there a better solution ? Yes, Allow the clusters to have varying radii with a technique called the “Maximum Entropy Method”.

What is “entropy” ? The entropy H of a data set of N points assigned to k differently clusters c1, c2, c3,…,cn is given by

H=- p(c1)log(p(c1)) + p(c1)log(p(c2)) + ... + p(cn)log(p(cn))

where p(ci) denotes the probability of hitting the i-th cluster, when an element of the data set is picked at random.

Since the probabilities add up to 1, the cluster that maximizes the entropy is one for which all clusters are identical. This means that the clusters will tend to cover the same number of points.


Maximum entropyHowever, there is still a problem - whenever the number of elements of each class in the data set is different. Consider the case of unlabeled speech data: some phonemes are more frequent than others and if a maximum entropy method is used, the boundaries between clusters will deviate from the natural solution and classify some data erroneously.So how do we solve this problem ? With the “Boostrapped Iterative Algorithm”:cluster: Computer a maximum entropy clustering with the training data. Label the original data data according to this clustering.select: Build a new training set by selecting from each class the same number of points (random selection with replacement). Go to the previous step.


Counterpropagation networkAre there any other hybrid network models ? Yes, the counter-propagation network as proposed by Hecht-Nielsen.

So what are counter-propagation networks designed for ? To approximate a continuous mapping f and it inverse f-1.

A counter-propagation consists of an n-dimentional input vector which is fed to a hidden layer consisting of h cluster vectors. The output is generated by a single linear associator unit. The weights in the network are adjusted using supervised learning.

The above network can successfully approximate functions of the form f: Rn -> R.


Fig. 3 Simplified counterpropagation nework


Counterpropagation networkThe training phase is completed in two parts

Training of the hidden layer into a clustering of input space that corresponds to an n-dimentional Voronoi tiling. The hidden layers output needs to be controlled so that only the element with the highest activation fires.The zi weights are then adjusted to represent the value of the approximation for the cluster region.

This network can be extended to handle multiple output

units.


Fig. 4 Function approximation with a counterpropagation network.


Spline networksCan the approximation created by a counterpropagation network be improved on? YesIn the counterpropagation network the Voronoi Tiling, is composed of a series horizontal tiles. Each of which represents an average of the function in that region.The spline network solves this problem by extending the hidden layer in the counterpropagation network. Each unit is paired with a linear associator, the cluster unit is used to inhibit or activate the linear associator which is connected to all inputs.This modification allows the resulting set of tiles to be oriented differently with respect to each other. Creating an approximation with a smaller quadratic error, and a better solution to the problem.Training proceeds as before except the newly added linear associators are trained using back propagation.


Fig. 5 Function approximation with linear associators


Radial basis functionsHas a simular structure as that of the counter propagation network. The difference is in the activation function used for each unit is Gaussian instead of Sigmoidal.The Gaussian approach uses locally concentrated functions.The Sigmodal approach uses a smooth step approach.Which is better depends on the specific problem at hand. If the function is smooth step then the Gaussian approach would require more units, where if the function is Gaussian then the Sigmodal approach will require more units.

modular neural networks ii presented by: david brydon karl martens david pereira cpsc 533 -...

Documents

art networks

feedforward networks

modular neural network

new weight vector

vector clusters

architecture slide

new associated weight

new data