chapter 6 neural network implementations. neural network implementations back-propagation networks...

Chapter 6Chapter 6Neural Network Neural Network

ImplementationsImplementations

Neural Network Implementations

Back-propagation networks

Learning vector quantizer networks

Kohonen self-organizing feature map networks

Evolutionary multi-layer perceptron networks

The Iris Data Set

Consists of 150 four-dimensional vectors (50 plants of each of three Iris species) 150,,1),,,( 4321 ixxxxx iiiii

Features are: sepal length, sepal width, petal length and petal width

We are working with scaled values in the range [0,1]

Examples of patterns:0.637500 0.437500 0.175000 0.025000 1 0 00.875000 0.400000 0.587500 0.175000 0 1 00.787500 0.412500 0.750000 0.312500 0 0 1

Implementation Issues

•Topology

•Network initialization and normalization

•Feedforward calculations

•Supervised adaptation versus unsupervised adaptation

•Issues in evolving neural networks

Topology

•Pattern of PEs and interconnections

•Direction of data flow

•PE activation functions

Back-propagation uses at least three layers; LVQ and SOFM use two.

Definition: Neural Network Architecture

Specifications sufficient to build, train, test, and operate a neural network

Back-propagation Networks

•Software on web site

•Topology

•Network input

•Feedforward calculations

•Training

•Choosing network parameters

•Running the implementation

Elements of an artificial neuron (PE)

•Set of connection weights

•Linear combiner

•Activation function

Back-propagation Network Structure

Back-propagation network input

•Number of inputs depends on application

•Don’t combine parameters unnecessarily

•Inputs usually over range [0,1], continuous valued

•Type float in C++: 24 bits value, 8 bits expon.; ~7 decimal places

•Scaling usually used as a preprocessing tool

•Usually scale on like groups of channels•Amplitude•Time

Feedforward Calculations •Input PEs distribute signal forward along multiple paths

•Fully connected, in general

•No feedback loop, not even self-feedback

•Additive sigmoid PE is used in our implementation

n

hihkhnik vxfy

0Activation of ith hidden PE:

where fn(.) is the sigmoid function 0 is bias PE

Sigmoid Activation Function

outpute input

1

1

Feedforward calculations, cont’d.

•Sigmoid function performs job similar to electronic amplifier (gain is slope)

•Once hidden layer activations are calculated, outputs are calculated:

z f y w fkj n ki jii

h

n

1

where is the sigmoid function

Training by Error Back-propagation

Error per pattern:

q

jkjkjk zbE

1

25.0

Error_signalkj

))(1())((' kjkjkjkjkjkjkjkj zbzzzbrf We derived this using the chain rule.

Backpropagation Training, Cont’d.

•But we have to have weights initialized in order to update them.

•Often (usually) randomize [-0.3, 0.3]

•Two ways to update weights: Online, or “single pattern” adaptationOff-line, or epoch adaptation (we use this in our back-prop)

Updating Output Weights Basic weight update

method:

k

kikjoldji

newji yww

But this tends to get caught in local minima.

So, introduce “momentum” α, [0,1]

k

oldjikikj

oldji

newji wyww )(

(includes bias weights)

Updating Hidden Weights

As derived previously:

q

jjikjkikiki wyy

1)1(

So,

k

oldihkhki

oldih

newih vxvv

Note: δ’s are calculated one pattern at a time, and are calculated using “old” weights.

Keep in mind…

In offline training: The deltas are calculated pattern by pattern, while the weights are updated once per epoch.

The values for η and α are usually assigned to the entire network, and left constant after good values are found.

When the δ’s are calculated for the hidden layer, the old (existing) weights are used.

Kohonen Networks

Probably second only to backpropagation in numberof applications

Rigorous mathematical derivation has not occurred

Seem to be more biologically oriented than most paradigmsReduce dimensionality of inputs

We’ll consider LVQI, LVQII, and Self-Organizing Feature Maps

Initial Weight Settings1.Randomize weights [0,1].

2. Normalize weights:

w

w

wjinorm ji

random

jirandom

i

p

2

1

•Note: Randomization often occurs in centroid areaof problem space.

Preprocessing Alternatives 1. Transform each variable onto [-1,1]2. Then normalize by:

a. Dividing each vector component by total length:

aal where l akiki

kk kh

h 2

or byb. “Z-axis normalization with a “synthetic” variable

fn

a f a s f n lik ik k k 1 2

or byc. Assigning a fixed interval (perhaps 0.1 or 1/n, whichever

is smaller) to a synthetic variable that is the scale factorin a. scaled to the fixed interval

Euclidean Distance

d t a t w tj k k i jii

p

2

1

for the j th PE, and the k th pattern

Distance Measures

d a wlj k i ji

l

i

nl

1

l = 1: Hamming distancel = 2: Euclidean distancel = 3: ???

Weight Updating Weights are adjusted in the neighborhood only

w t w t t a t w t for j Nji ji k i ji 1

Sometimes, ttz

0 2 1. where z = total no. of iterations

Rule of thumb: No. of training iterations should be about 500 times the number of output PEs. * Some people start out with eta = 1 or near 1. * Initial neighborhood shoud include most or all

of output PE field * Options exist for configuration of output slab: ring,

cyl. surface, cube, etc.

Error Measurement

*Unsupervised, so no “right” or “wrong”

*Two approaches – pick or mix

* Define error as mean error vector length

* Define error as max error vector length (adding PE when this is large could improve performance)

* Convergence metric:max_error_vector_length/eta

(best when epoch training is used)

Learning Vector Quantizers: Outline

•Introduction

•Topology

•Network initialization and input

•Unsupervised training calculations

•Giving the network a conscience

•LVQII

•The LVQI implementation

Learning Vector Quantization: Introduction

•Related to SOFM

•Several versions exist, both supervised and unsupervised

•LVQI is unsupervised; LVQII is supervised (I & II do not correspond to Kohonen’s notation)

•Related to perceptrons and delta rule, however :* Only one (winner) PE’s weights updated* Depending on version, updating is done for correct

and/orincorrect classification

* Weight updating method analogous to metric used topick winning PE for updating

* Network weight vectors approximate density functionof input

LVQ-I Network TopologyLVQ-I Network Topology

LVQI Network Initialization and Input

•LVQI clusters input data

•More common to input raw data (preprocessed)

•Usually normalize input vectors, but sometimes better not to

•Initial normalization of weight vectors almost always done, but in various ways

•In implementation, for p PEs in output layer, first p patterns chosen randomly to initiate weights

Weight and Input Vector Weight and Input Vector InitializationInitialization

(a) before, (b) after, input vector normalization

LVQ Version I - Unsupervised Training

•Present one pattern at a time, and select winning output PE based on minimum Euclidean distance

•Update weights:

w w t a w for winner only

w w for all others

jinew

jiold

ki ji

jinew

jiold

•Continue until weight changes are acceptably small or max. iterations occur

•Ideally, output will reflect probability distribution of input

•But, what if we want to more accurately characterize thedecision hypersurface?

•Important to have training patterns near decision hypersurface

Giving the Network a Conscience

•The optimal 1/n representation by each output PE is unlikely (without some “help”)

•This is especially serious when initial weights don’t reflect the probability distribution of the input patterns

•DeSieno developed a method for adding a conscience to the network

In example: With no conscience, given uniformdistribution of input patterns, w7 will win abouthalf of the time, other weights about 1/12 ofthe time each.

Conscience Equations

(c)

(b)1

(a)PEsother 0,minfor 1

oldjj

oldj

newj

jj

jjjwinnerj

fyff

fn

b

ybdy

Conscience Parameters

•Conscience factor fj with initial value = 1/n (so initial bias values are all 0)

•Bias factor γ set approximately to 10

•Constant β set to about .0001(set β so that conscience factors don’t reflect noise in the

data)

Example of Conscience If there are 5 output PEs, then 1/n = 0.2 = all initial fj values

Biases are 0 initially, and first winner is selected based on Euclidean distance minimum

Conscience factors are now updated:Winner’s fj = [0.2 + 0.0001(1.0 - 0.2)] = 0.20008All others’ fj = 0.2 - 0.00002 = 0.19998

Winner’s bj = – .0008; all others’ bj = 0.0002

Probability Density FunctionProbability Density Function

Shows regions of equal area

Learning: No ConscienceLearning: No Conscience

A = 0.03 for 16,000 iterations

Learning: With ConscienceLearning: With Conscience

A = 0.03 for 16,000 iterations

With Conscience, Better Weight With Conscience, Better Weight AllocationAllocation

LVQ - Version II - Supervised* Instantiate first p ak vectors to weights wji

* Relative numbers of weights assigned by class must correspond to a priori probabilities of classes

* Assume pattern Ak belongs to class Cr and that the winning

PE’s weight vector belongs to class Cs ; then for winning PE:

w w t a w if C C

w w t a w if C C

jinew

jiold

k j ji r s

jinew

jiold

k j ji r s

For all other PEs, no weight changes are done.* This LVQ version reduces misclassifications

Evolving Neural Networks: Outline

•Introduction and definitions

•Artificial neural networks

•Adaptation and computational intelligence

•Advantages and disadvantages of previous approaches

•Using particle swarm optimization (PSO)

•An example application

•Conclusions

Introduction •Neural networks are very good at some problems, such as mapping input vectors to outputs

•Evolutionary algorithms are very good at other problems, such as optimization

•Hybrid tools are possible that are better than either approach by itself

•Review articles on evolving neural networks: Schaffer, Whitley, and Eshelman (1992); Yao (1995); and Fogel (1998)

•Evolutionary algorithms usually used to evolve network weights, but sometimes used to evolve structures and/or learning algorithms

Typical Neural Network

OUTPUTS

INPUTS

More Complex Neural Network

Evolutionary Algorithms (EAs) Applied to Neural

Network Attributes

•Network connection weights

•Network topology (structure)

•Network PE transfer function

•Network learning algorithms

Early Approaches to Evolve Weights

•Bremmerman (1968) suggested optimizing weights in multilayer neural networks.

•Whitley (1989) used GA to learn weights in feedforward network; used for relatively small problems.

•Montana and Davis (1989) used “steady state” GA to train 500-weight neural network.

•Schaffer (1990) evolved a neural network with better generalization performance than one designed by human.

Evolution of Network Architecture

•Most work has focused on evolving network topological structure

•Less has been done on evolving processing element (PE) transfer functions

•Very little has been done on evolving topological structure and PE transfer functions simultaneously

Examples of Approaches

•Indirect coding schemesEvolve parameters that specify network topologyEvolve number of PEs and/or number of hidden layers

•Evolve developmental rules to construct network topology

•Stork et al. (1990) evolved both network topology and PE transfer functions (Hodgkin-Huxley equation) for neuron in tail-flip circuitry of crayfish (only 7 PEs)

•Koza and Rice (1991) used genetic programming to find weights and topology. They encoded a tree structure of Lisp S-expressions in the chromosome.

Examples of Approaches, Cont’d.

•Optimization of EA operators used to evolve neural networks (optimize hill-climbing capabilities of GAs)

•Summary:•Few quantitative comparisons with other approaches typically given (speed of computation, performance, generalization, etc.)•Comparisons should be between best available approaches (fast EAs versus fast NNs, for example)

Advantages of Previous Approaches

•EAs can be used to train neural networks with non-differentiable PE transfer functions.

•Not all PE transfer functions in a network need to be the same.

•EAs can be used when error gradient or other error information is not available.

•EAs can perform a global search in a problem space.

•The fitness of a network evolved by an EA can be defined in a way appropriate for the problem. (The fitness function does not have to be continuous or differentiable.)

Disadvantages of Previous Approaches

•GAs do not generally seem to be better than best gradient methods such as quickprop in training weights

•Evolution of network topology is often done in ways that result in discontinuities in the search space (e.g., removing and inserting connections and PEs). Networks must therefore be retrained, which is computationally intensive.

•Representation of weights in a chromosome is difficult.•Order of weights?•Encoding method?•Custom designed genetic operators?

Disadvantages of Previous Approaches, Cont’d.

Permutation problem (also known as competing conventions problem or isomorphism problem ): Multiple chromosome configurations can represent equivalent optimum solutions. Example: various permutations of hidden PEs can represent equivalent networks.

We believe, as does Hancock (1992), that this problem is not as severe as reported. (In fact, it may be an advantage.)

Evolving Neural Networks with Particle Swarm Optimization

•Evolve neural network capable of being universal approximator, such as backpropagation or radial basis function network.

•In backpropagation, most common PE transfer function is sigmoidal function: output = 1/(1 + e - input )

•Eberhart, Dobbins, and Simpson (1996) first used PSO to evolve network weights (replaced backpropagation learning algorithm)

•PSO can also be used to indirectly evolve the structure of a network. An added benefit is that the preprocessing of input data is made unnecessary.

Evolving Neural Networks with Particle Swarm Optimization,

Cont’d.

•Evolve both the network weights and the slopes of sigmoidal transfer functions of hidden and output PEs.

•If transfer function now is: output = 1/(1 + e -k*input ) then we are evolving k in addition to evolving the weights.

•The method is general, and can be applied to other topologies and other transfer functions.

•Flexibility is gained by allowing slopes to be positive or negative. A change in sign for the slope is equivalent to a change in signs of all input weights.

Evolving the Network Structure with PSO

•If evolved slope is sufficiently small, sigmoidal output can be clamped to 0.5, and hidden PE can be removed. Weights from bias PE to each PE in next layer are increased by one-half the value of the weight from the PE being removed to the next-layer PE. PEs are thus pruned, reducing network complexity.

•If evolved slope is sufficiently high, sigmoid transfer function can be replaced by step transfer function. This works with large negative or positive slopes. Network computational complexity is thus reduced.

Evolving the Network Structure with PSO, Cont’d.

•Since slopes can evolve to large values, input normalization is generally not needed. This simplifies applications process and shortens development time.

•The PSO process is continuous, so neural network evolution is also continuous. No sudden discontinuities exist such as those that plague other approaches.

Example Application: the Iris Data Set

•Introduced by Anderson (1935), popularized by Fisher (1936)

•150 records total; 50 of each of 3 varieties of iris flowers

•Four attributes in each record•sepal length•sepal width•petal length•petal width

•We used both normalized and unnormalized versions of the data set; all 150 patterns were used to evolve a neural network. Issue of generalization was thus not addressed.

Example Application, Continued

•Values of -k*input > 100 resulted in clamping PE transfer output to zero, to avoid computational overflow.

•Normalized version of data set first used to test concept of evolving both weights and slopes. Next we looked at threshold value for slope at which the sigmoidal transfer function could be transitioned into a step function without significant loss in performance.

Performance Variations with Slope Thresholds

Discussion of Example Application •Average number of errors was 2.15 out of 150 with no slope threshold. (This is a good result for this data set.)

•Accuracy degrades gracefully until slope threshold decreases to 4.

•Preliminary indication is that slopes can be evolved, and that a slope threshold of about 10 to 20 would be reasonable for this problem.

•Other data sets are being examined.

•More situations with slopes near zero are being tested.

Un-normalized Data Set Results

One set of runs; 40 runs of 1000 generations

Number correct 149 148 147 146 145 144 100 99

Number of runswith this number 11 16 6 3 1 1 1 1correct

Good solution obtained in 38 of 40 runs. Average number correct was 145.45. Ignoring two worst solutions, average of only 2 mistakes.

Examples of Recent Applications

•Scheduling (Integrated automated container terminal)

•Manufacturing (Product content combination optimization)

•Figure of merit for electric vehicle battery pack

•Optimizing reactive power and voltage control

•Medical analysis/diagnosis (Parkinson’s disease and essential tremor)

•Human performance prediction (cognitive and physical)

Conclusions

•Brief review of applying EC techniques to evolving neural networks was presented. Advantages and disadvantages were summarized.

•A new methodology for using particle swarm optimization to evolve network weights and structures was presented.

•The methodology seems to overcome the first four disadvantages discussed.

•We believe that multimodality is a help rather than a hindrance with EAs (including PSO).

•Iris Data Set was used as an example of new approach.

The BP Software An implementation of a fully-connected feed-forward network.

main() routineBP_Start_Up()reads parameters from input (run)

file and allocates memory

BP_Clean_Up()stores results in output file and de-allocates memory

bp_state_handler() is the most important part of the BP state machine

Output PEs can be linear or sigmoid; hidden are always sigmoid.

Number of layers and number of PEs per layer can be specified.

Back-prop. State Transition Diagram

BP Software, Cont’d. Enumeration data types used for:

•NN operating mode (train or recall)

•PE function type

•Nature of the layer (input, hidden, output)

•Training mode (offline or online)

•States in the state machine

Enumeration Data Types for All Enumeration Data Types for All NNsNNs

Enumeration Data Types for Back-Enumeration Data Types for Back-prop. prop.

BP Software, Cont’d.

Structure data types used for:

•PE configuration

•Network configuration

•Environment and training parameters

•Network architecture

•Pattern configuration

Structure Data Type ExampleStructure Data Type Example

Structure data type BP_Arch_Type defines the network architecture:

Number of layers

Pointer to layers

Pointer to number of PEs in hidden layers

BP State Handler

•Total of 15 states

•Most important part of the state machine

•Routes program to proper state

Running the BP Software To run, you need bp.exe and a run file, such as iris_bp.run

First train, then test.

For example:

To train, run: bp iris_bpr.runYou will get: bp_res.txt (weights of trained net)You will see (or you can >filename1): error values for each iteration

To test, run: bp iris_bps.runYou will get: bp_test.txt (summary of correct patterns)You will see (or >filename2): detailed results(I run bp iris_bps.run >irisres.txt)

Sample BP Run File

0 0=train 1=test0 if train, 0=batch 1=sequential0.075 learning rate0.15 momentum rate0.01 error termination criterion (not implemented)10000 max number of generations99 number of training patterns3 number of layers ( 3 -> one hidden layer)4 number of PEs in hidden layer150 total number of patterns in pattern file4 dimension of input 3 dimension of outputiris.dat data file

Choosing BP Network Parameters

How many hidden PEs?

Guess/estimate:

]2,1[22 isCwherennC li

(This is only a “rule of thumb.”)

Choosing BP Network Parameters

•Too few hidden PEs, and network won’t generalize or won’t train

•Too many hidden PEs, and the net will “memorize”

•Assign one output PE per class

•Probably best to start with low values for η and α

•Avoid getting stuck on an error value that’s too high, maybe .06 or .08 SSE/pattern/PE

•I often try values of η between 0.02 and 0.20, and α = [0.01, 0.10]

The Kohonen Network Implementations

Learning vector quantization (LVQ) software implementation is presented first.

The self-organizing feature map (SOFM) is presented next.

LVQ Software General definitions (in BP section) are still valid. New data types are defined in enumeration and structure data type code.

Enumeration types: Network can be trained randomly or sequentially, and can use (or not use) a conscience (described later).

Structure types: Establish PE type, define environment parameters such as training parameters, flag for conscience, and the number of clusters, which is the number of output PEs.

LVQ Software, Cont’d.

LVQ Software, Cont’d.

main() routineLVQ_Start_Up()reads parameters from input (run) file and allocates memoryLVQ_Main_Loop is the primary part of the implementationLVQ_Clean_Up()stores results in output file and de-allocates memory

The LVQ implementation has 13 states.

LVQ State Diagram for Training LVQ State Diagram for Training ModeMode

LVQ Software, Cont’d. Output PEs are linear. Weights (from all inputs to an output) are normalized.

Euclidean distance calculated between input vector and each weight vector.

The output PE with the smallest distance between input and weight vectors is selected as winner.

Weight vector of winning PE is updated, then the learning rate is updated.

If conscience is used, the conscience factor is updated.

LVQ Run File 0 0=train, 1=test0 0=random pattern selection, 1=sequential0.3 initial learning rate0.999 learning rate shrinking factor 10 bias factor (gamma)0.0001 beta0.001 training termination criterion500 max number of iterations99 number of training patterns1 1=conscience6 max number of clusters

150 total number of patterns4 input dimension3 output dimensioniris.dat data file

tt 1 :

LVQ Results File Example

0.7896280.5739900.2134850.038044

0.6965140.3355830.5927440.225625

0.7270000.2997440.5892540.185483

0.8084150.5293620.2543450.039350

0.2075250.0754630.1305910.966532

0.7601800.3482390.5247170.159773

Weights to first ouput PE (first cluster)

Sixth cluster weights

LVQ Test File ExampleCluster Class 0 Class 1 Class 2

----------------------------------

0 0 0 26

1 0 25 0

2 0 22 6

3 29 0 0

4 21 0 0

5 0 3 18

Class 0: clusters 3 and 4



141 out of 150 clustered “correctly”

Self Organizing Feature Maps An extension of LVQ; use LVQ features such as the conscience

Also developed by Teuvo Kohonen

Utilize slabs of PEs

Incorporate the concept of a neighborhood

Primary features of input cause corresponding local responses

in the output PE field.

Are non-linear mappings of input space onto the output PEspace (field).

SOFM Slab of PEs

•PEs in a slab have similar attributes.

•The slab has a fixed topology.

•Most slabs are two-dimensional.

Hexagonal Slab of PEs

SOFM Network Model

More likely to use raw data as input to SOFM.

Kohonen often initializes weight vectors to be between 0.4 and 0.6 in length.

Winning output PE has minimum Euclidean distance between input and weight vectors.

n

iijikjk wad

1

2

(Can use conscience)

SOFM Weight Updating

))(()()()1( jikijiji wttntwtw

Sometimes,

z

tt 12.0

where z = total number of iterations, and t is the iteration index.

Weight updates made to winning PE and its neighborhood.Learning coefficient and neighborhood both shrink over time.

SOFM Neighborhood Types

Hats

Sombrero Stovepipe hat Chef’s hat

SOFM Phases of Learning

Two phases of learning in the Kohonen SOFM:

1. Topological ordering, where the weight vectors order themselves.

2. Convergence, in which fine tuning occurs.

SOFM Hints

Rule of thumb: No. of training iterations should be about 500 times the number of output PEs.

Some people start out with eta near 1.0.

The initial neighborhood should include most or all of the output PE slab.

Options exist for the configuration of the output slab: ring, cylindrical surface, cube, etc.

SOFM Error Measurement

Unsupervised, so no right or wrong

Two approaches – pick or mix• Define error as mean error vector length• Define error as max error vector length (adding PE when this is large could improve performance)

Convergence metric could be:Max_error_vector_length/eta

(best when epoch training is used)

SOFM Advantages

•Can do real-time non-parametric pattern classification

•Don’t need to know classes a priori

•Does nearest neighbor-like classifications

•Relatively simple paradigm

•Can deal with many classes

•Can handle high-dimensionality inputs

SOFM Disadvantages

•Long training time

•Can’t add new classes without retraining

•Hard to figure out how to implement

•Not good with parameterized data

•Must normalize input patterns (?)

SOFM Applications

•Speech processing

•Image processing

•Data compression

•Combinatorial optimization

•Robot control

•Sensory mapping

•Preprocessing

SOFM Run File 0 Training/recall 0 = train; 1 = recall0 Training mode if training, 0 = random0.3 Learning rate0.999 Shrinking coefficient10 Bias factor0.0001 Beta0.001 Training error criterion for termination500 Maximum number of generations99 Number of patterns used for training1 1 = conscience; 0 = no conscience1 Initial width of neighborhood1 Initial height of neighborhood4 Output slab height4 Output slab width0 Neighborhood function type (0 = chef hat)150 Total number of patterns4 Input dimension3 Output dimensioniris.dat Data file for patterns

SOFM Weights File0.7626950.4092300.4775940.150768

0.7765560.4436710.4286120.128095

0.7786680.4212590.4464060.130147

0.7651850.3915210.4887350.149472

0.7698930.4188760.4607600.139670

0.7848090.4615580.3980940.112071

0.7859690.4370320.4213400.117167

0.7711240.4015490.4734820.141214

0.7442400.3793030.5212460.174752

0.7577580.3975940.4924670.158740

0.7587430.3763170.5074930.158574

0.7488110.3635230.5272340.170756

0.7310070.3574750.5498750.188358

0.7454250.3740620.5238100.173326

0.7521470.3629090.5247940.164813

0.7368540.3453780.5516980.182727

Weights from inputs to first output PE

Weights from inputs to last output PE

O O O O

O O O O

O O O O

O O O O

Last PE

First PE

SOFM Test Results Class 0 Class 1 Class 2-----------------------------------------------------00 00 0 0 000 01 0 0 000 02 0 0 000 03 0 1 001 00 0 0 001 01 50 0 001 02 0 0 001 03 0 1 002 00 0 3 002 01 0 1 002 02 0 4 002 03 0 1 003 00 0 7 2503 01 0 3 003 02 0 14 003 03 0 15 25

Also output is cluster assignment for each pattern.

Attributes Needed to Specify a Kohonen SOFM

Number and configuration of input PEsNumber and configuration of input PEs Number and configuration of output PEsNumber and configuration of output PEs Dimensionality of output slab (1, 2, 3, etc.)Dimensionality of output slab (1, 2, 3, etc.) Geometry of output slab (square or hexagonal Geometry of output slab (square or hexagonal

neighborhood, wraparound or not)neighborhood, wraparound or not) Neighborhood definition as function of timeNeighborhood definition as function of time Learning coefficient as function of time and spaceLearning coefficient as function of time and space Initialization of weightsInitialization of weights Preprocessing (normalization) and presentation (random or Preprocessing (normalization) and presentation (random or

not) of inputsnot) of inputs Method to select winner (Euclidean distance or dot product)Method to select winner (Euclidean distance or dot product)

Summary of SOFM Process

Allocate storageAllocate storageRead weights and patternsRead weights and patterns Loop through iterationLoop through iteration Loop through patternsLoop through patterns Compute activationsCompute activations Find winning PEFind winning PE Adapt weights of winner and its neighborhoodAdapt weights of winner and its neighborhood Shrink neighborhood sizeShrink neighborhood size Reduce learning coefficient etaReduce learning coefficient eta If eta <= 0, breakIf eta <= 0, breakWrite final weightsWrite final weightsWrite activation valuesWrite activation valuesFree storage Free storage

Evolutionary Back-Propagation Implementation

•A merger of the back-propagation implementation and the PSO implementation

•PSO is used only to evolve weights (not slopes of sigmoid functions)

•BP is used only in recall mode; the outputs are used to evaluate fitness for each particle (candidate set of weights)

Evolutionary BP, Cont’d.

•Both BP and PSO start-up and clean-up routines are included

•Length of individual particles is calculated from dimensions in input file

•Particle elements correspond to individual weights

•BP recall is run for each particle after each iteration of PSO to evaluate fitness (error)

•The BP network is the “problem” for PSO to solve

Main Routine for Evolutionary Back-Prop

void main (int argc, char *argv[]){ // check command line if (argc != 3) {

printf("Usage: exe_file pso_run_file bp_run_file\n"); exit(1);

} // initialize main_start_up(argv[1],argv[2]); PSO_Main_Loop(); main_clean_up();}static void main_start_up (char *psoDataFile,char *bpDataFile){ BP_Start_Up(bpDataFile); PSO_Start_Up(psoDataFile); }static void main_clean_up (void){ PSO_Clean_Up(); BP_Clean_Up();}

Running the Evolutionary BP Network Implementation

•Need the executable file pso_nn.exe

•Need two run files, such as pso.run and bp.run

•PSO run file same as for single PSO, except that length of particle not specified

•BP run file is short; only information for recall needed

Example bp.run:3 # of layers4 # hidden PEs150 # patterns4 # inputs3 # outputsiris.dat data file

PSO Run File

101

11710-10.010.0510200

30

0.90

// num of psos// pso_update_pbest_each_cycle_flag// total cycles of running PSOs

// optimization type: min or max – max. no. correct// evaluation function – 17 calls BP weights from PSO// inertia weight update method// initialization type: sym/asym// left initialization range// right initialization range// maximum velocity// maximum position// max number of generations

// population size

// initial inertia weight// boundary flag// boundaries if boundary flag is 1

BP_RES.TXT Output File

Weights from inputs to first hidden PE (bias first)…Weights from inputs to last hidden PE (bias first)Weights from first hidden to first output PE (bias first)…Weights from last hidden to last output PE (bias first)

BP_RES.TXT Output File

-2.555491-3.5600392.1983718.452043-0.000573-4.7036306.4409888.627151-3.1950240.699212-1.443098-6.5842950.4306292.2378920.960514-5.099212-3.3147130.362337-8.708467-3.981537-5.6760662.128347-1.1521005.140296-3.9948244.449585-2.0121870.222005-3.648189-1.8763807.9730766.194356-0.598305-6.768669-11.408623

Weights to first hidden PE (bias first)

Weights to 4th hidden PE (bias first)

Weights to first output PE (bias first)

Weights to 3rd output PE (bias first)

Example Application: the Iris Data Example Application: the Iris Data SetSet

Introduced by Anderson (1935), popularized by Fisher Introduced by Anderson (1935), popularized by Fisher (1936)(1936)

150 records total; 50 of each of 3 varieties of iris flowers150 records total; 50 of each of 3 varieties of iris flowers Four attributes in each recordFour attributes in each record

sepal lengthsepal length sepal widthsepal width petal lengthpetal length petal widthpetal width

We used both normalized and unnormalized versions of We used both normalized and unnormalized versions of the data set; all 150 patterns were used to evolve a the data set; all 150 patterns were used to evolve a neural network. Issue of generalization was thus not neural network. Issue of generalization was thus not addressed.addressed.

Example Application, ContinuedExample Application, Continued

Values of Values of -k*input > 100 -k*input > 100 resulted in clamping PE resulted in clamping PE transfer output to zero, to avoid computational transfer output to zero, to avoid computational overflow.overflow.

Normalized version of data set first used to test Normalized version of data set first used to test concept of evolving both weights and slopes. concept of evolving both weights and slopes. Next we looked at threshold value for slope at Next we looked at threshold value for slope at which the sigmoidal transfer function could be which the sigmoidal transfer function could be transitioned into a step function without transitioned into a step function without significant loss in performance.significant loss in performance.

Performance Variations with Slope Performance Variations with Slope ThresholdsThresholds

Slope threshold s(absolute value)

Total numbercorrect in 40 runs

Average numbercorrect per run

Variance

None 5914 147.85 1.57

80 5914 147.85 1.57

40 5911 147.78 1.77

20 5904 147.60 1.94

10 5894 147.35 2.08

4 5814 145.35 62.75

For each threshold value, 40 runs of 1000 generations were made of the 150-pattern data set.

Discussion of Example ApplicationDiscussion of Example Application Average number of errors was 2.15 out of 150 Average number of errors was 2.15 out of 150

with no slope threshold. (This is a good result with no slope threshold. (This is a good result for this data set.)for this data set.)

Accuracy degrades gracefully until slope Accuracy degrades gracefully until slope threshold decreases to 4.threshold decreases to 4.

Preliminary indication is that slopes can be Preliminary indication is that slopes can be evolved, and that a slope threshold of about evolved, and that a slope threshold of about 10 to 20 would be reasonable for this problem.10 to 20 would be reasonable for this problem.

Other data sets are being examined.Other data sets are being examined. More situations with slopes near zero are More situations with slopes near zero are

being tested.being tested.

Un-normalized Data Set ResultsUn-normalized Data Set Results

One set of runs; 40 runs of 1000 generationsOne set of runs; 40 runs of 1000 generations

Number correct 149 148 147 146 145 144 100 99

Number of runswith this number 11 16 6 3 1 1 1 1correct

Good solution obtained in 38 of 40 runs. Average number correct was 145.45. Ignoring two worst solutions, average of only 2 mistakes.

Examples of Recent ApplicationsExamples of Recent Applications

Scheduling (Integrated automated container Scheduling (Integrated automated container terminal)terminal)

Manufacturing (Product content combination Manufacturing (Product content combination optimization)optimization)

Figure of merit for electric vehicle battery packFigure of merit for electric vehicle battery pack Optimizing reactive power and voltage controlOptimizing reactive power and voltage control Medical analysis/diagnosis (Parkinson’s Medical analysis/diagnosis (Parkinson’s

disease and essential tremor)disease and essential tremor) Human performance prediction (cognitive and Human performance prediction (cognitive and

physical)physical)

chapter 6 neural network implementations. neural network implementations back-propagation networks...

Documents

calculated pattern

propagation error

old weights

entire network

old existing weights

sigmoid function

backpropagation training

petal length