chapter 3 mapping networks

Chapter 3Mapping Networks

國立雲林科技大學資訊工程研究所張傳育 (Chuan-Yu Chang ) 博士Office: ES 709

TEL: 05-5342601 ext. 4337

E-mail: [email protected]

Chuan-Yu Chang Ph.D.

2

Introduction

Mapping Networks Associative memory networks Feed-forward multilayer perceptrons Counter-propagation networks Radial Basis Function networks.

Mapping ANN(.)

x y


3

Associative Memory Networks 記憶容量 (memory capability) 對於資訊系統而言，記憶 (remember) 並且降低 (reduce) 所儲存的資訊量是很重要的。

儲存的資訊必須被適當的存放在網路的記憶體中，也就是說，給予一個關鍵的 (key) 輸入或刺激 (stimulus) ，將從聯想式記憶體 (associative memory) 中擷取記憶化的樣本 (memorized pattern) ，以適當的方式輸出。


4

Associative Memory Networks (cont.) 在神經生物學的系統中，記憶 (memory) 的概念與神經因環境和有機結構的互動改變有關。如果變化不存在，則不會有記憶存在。

如果記憶是有用的，則它必須是可被存取的( 從神經系統 ) ，所以會發生學習 (learning) ，而且也可以擷取 (retrieval) 資訊。

一個樣本 (pattern) 可經由學習的過程 (learning process) 存在記憶體。


5

Associative Memory Networks (cont.) 依據記憶時間的長短，有兩種記憶形式：

Long-term memory Short-time memory

有兩種基本的 associative memory Autoassociative memory

A key input vector is associated to itself. The input and output space dimensions are the same.

Heteroassociative memory Key input vectors are associated with arbitrary memorized vec

tors. The output space dimension may be different from the input s

pace dimension.


6

General Linear Distributed Associative Memory General Linear Distributed Associative Memory 的

學習過程給網路一個 key input pattern (vector) ，然後記憶體轉換這個向量 (vector) 成一個儲存 ( 記憶 )的 pattern.


7

General Linear Distributed Associative Memory (cont.) Input (key input pattern)

Output (memorized pattern)

對一個 n-dimension 的神經架構，可聯想 (associate) h 個 pattern 。 h<=n 。實際上是 h < n 。

Tknkkk xxxx ,...,, 21

Tknkkk yyyy ,...,, 21

(3.1)

(3.2)


8

General Linear Distributed Associative Memory (cont.) Key vector xk 和記憶向量 yk間的線性對應可寫成：

其中 W(k) 為權重矩陣 (weight matrix) 從這些權重矩陣，可建構 Memory matrix M

describes the sum of the weight matrices for every input/output pair. This can be written as

the memory matrix can be thought of as representing the collective experience.

kk xkWy )(

h

k

kWM1

)(

hkkWMM kk ,,2,1for ),(1

(3.3)

(3.4)

(3.5)

00 M


9

Correlation Matrix Memory

Estimation of the memory matrix Outer product rule

Each outer product matrix is an estimate of the weight matrix W(k) that maps the output pattern yk onto the input pattern xk.

An associative memory designed in outer product rule is called a correlation matrix memory

h

k

Tkk xyM

1

ˆ

hkxyMM Tkkkk ,,2,1for ,ˆˆ

1

(3.6)

(3.7)


10

Correlation Matrix Memory (cont.)

當某 key pattern 輸入時，希望 memory matrix 能 recall 適當的 memorized pattern ，假設 estimated memory matrix 已經由 (3.6) 學習了 h 次，則輸入第 q 個 key input xq，欲求 memorized yq

(3.8) 式可改寫成 ( 將 (3.6) 代入 (3.8))

假設每個 key input vector 具有 normalized unit length.

kqTk

h

kq

Tkkq yxxxxyxMy )(ˆ

1 (3.8)

kq

h

qkk

Tkqq

Tq yxxyxxy )()(

1

(3.9)

hkxx kTk ,,2,1for 1 (3.10)


11


此 normalized unit length 可藉由除以 Euclidean norm 來求得：

因此， (3.9) 可改寫成

其中

2k

kk

x

xx

(3.11)

h

qkk

qqkqTkq zyyxxyy

1

)( (3.12)

h

qkk

kqTkq yxxz

1

)( (3.13)

Noise, crosstalk

Desired response


12


從 (3.13) 當網路引進 associated key pattern (y=yq, zq=0) 若 y=yq表示輸入 xq可精確的得到其對應的輸出 yq。若 zq不為 0 ，表示有 noise 或 crosstalk 。

若 key input 之間為 orthogonal ，則 crosstalk 為 0 因此可得到 perfect memorized pattern 。

若輸入的 key input 之間存在 linearly independent ，則可在計算 memory matrix 之前先使用 Gram-Schmidt orthogonalization 來產生 orthogonal input key vectors 。

1

1

)(k

ii

iTi

kTi

kk ggg

xgxg (3.14)


13


The storage capacity of the associative memory is

The storage limit depends on the rank of the memory matrix.

nM ˆ


14

Correlation Matrix Memory (cont.) Example 3.1

Autoassociative memoryThe memory is trained with three key vectors:

Each of these vectors has unit length.

5556.0

6667.0

4969.0

7027.0

5556.0

4444.0

5329.0

7778.0

3333.0

321 xxx (3.15)


15


The angles between these vectors:

These three vectors are far from being mutually orthogonal.

o

2221

21112 9.101cos

xx

xx T

(3.16)

o

2321

31113 5.49cos

xx

xx T

(3.17)

o

2322

32123 1.76cos

xx

xx T

(3.18)


16


The memory matrix of the autoassociative network

Using (3.19), the estimate of the input key patterns

0865.13945.04107.0

3945.03582.11749.0

4107.01749.05555.0ˆ

332211TTT xxxxxxM

(3.19)

0707.1

0378.1

3876.0ˆˆ

7268.0

5551.0

6326.0ˆˆ

7489.0

3249.1

1023.0ˆˆ 332211 xMxxMxxMx (3.20)


17


The Euclidean distance of the response vector from each of the key vectors.

9108.0ˆ

9589.1ˆ 6319.0ˆ

21331

2122121111

xx

xxxx

(3.21)

2412.1ˆ

1898.0ˆ 6575.1ˆ

22332

2222222112

xx

xxxx

(3.22)

6442.0ˆ

6363.1ˆ 9363.0ˆ

23333

2322323113

xx

xxxx

(3.23)

有較小的 error


18

Correlation Matrix Memory (cont.) Another key vector with unit length

The angles between these vectors

7349.0

2370.0

6354.0

6533.0

0587.0

7548.0

1629.0

9779.0

1309.0

321 xxx (3.24)

o

2221

21112 9.92cos

xx

xx T

(3.25)

o

2321

31113 3.88cos

xx

xx T

(3.26)

o

2322

32123 8.90cos

xx

xx T

(3.27)


19


The memory matrix is

The estimate of input key patterns

9934.00532.00048.0

0532.00159.10217.0

0048.00217.09906.0ˆ

332211TTT xxxxxxM (3.28)

7396.0

2661.0

6207.0ˆˆ

6558.0

1108.0

7521.0ˆˆ

1092.0

9876.0

1501.0ˆˆ 332211 xMxxMxxMx (3.29)


20


The Euclidean distance of the response vector from each of the key vectors.

3758.1ˆ

4865.1ˆ 0579.0ˆ

21331

2122121111

xx

xxxx

(3.30)

4382.1ˆ

0522.0ˆ 4859.1ˆ

22332

2222222112

xx

xxxx

(3.31)

0329.0ˆ

4365.1ˆ 3734.1ˆ

23333

2322323113

xx

xxxx

(3.32)

相較於 (3.21)~(3.23) 有更低的 error 。


21

Correlation Matrix Memory (cont.) An error correction approach for correlation

matrix memories The drawback of associative memory is the

relatively large number of errors that can occur during recall.

The simplest correlation matrix memory has no provision for correcting errors Lack of feedback from the output to input.


22


The objective of the error correction approach is to have the associative memory reconstruct the memorized pattern in an optimal sense.

The error vector is defined as

其中， yk是 desired pattern with the key input pattern xk. The discrete-time learning rule based on steepest descent

kkk xMye )(ˆ)(

ˆ)(ˆ)1(ˆˆ

MMM

(3.33)

(3.34)


23


其中 energy function 定義成

Computing the gradient of (3.35)

將 (3.36) 代入 (3.34)

2

2

2

2)(ˆ

2

1

2

1)ˆ( kkk xMyeM

Tkk

TkkM

xxMxyM ˆ)ˆ(ˆ

Tkkk xxMyMM ])(ˆ[)(ˆ)1(ˆ

(3.35)

(3.36)

(3.37)

Error vector


24


(3.37) 式的第二項中的中括號的內容，為 error vector( 參見 (3.33)) ，因此具有 error correction 的效果。

將 (3.37) 展開可改寫成

和 (3.7) 比較起來，多了第二項。 (3.37) 式基於 error correction 的監督式學習是對 h個 association 的每個不同 pattern 重複訓練所得到。

Tkk

Tkk xxMxyMM )(ˆ)(ˆ)1(ˆ

hkyx kk ,,2,1for ,

(3.38)

(3.39)


25

Correlation Matrix Memory (cont.) Selecting the learning rate parameter

Fixed learning rate Adjustable learning rate with respect to time For each association, the iterative adjustments to t

he memory matrix (3.37) continue until the error vector ek (3.33) becomes negligibly small.

The initial memory matrix M(0)=0. 和 (2.29) 比較，發現 (3.37) 也是 LMS 。


26

Backpropagation Learning Algorithm Backpropagation algorithm

Generalized delta rule Training MLPs with backpropagation algorithms results in a

nonlinear mapping. The MLPs can have its synaptic weights adjusted by the B

P algorithm to develop a specific nonlinear mapping. The fixed weights after the training process can provide an

association task for classification, pattern recognition, diagnosis, etc.

During the training phase of the MLP, the synaptic weights are adjusted to minimize the disparity between the actual and desired outputs of the MLP, averaged over all input patterns.


27

Backpropagation Learning Algorithm (cont.) Basic BP algorithm for the feedforward MLP

3 layers network -1 output layer -2 hidden layers


28

Backpropagation Learning Algorithm (cont.)

The standard BP algorithm for training of the MLP NN is based on the steepest descent gradient approach applied to the minimization of an energy function representing the instantaneous error.

其中 dq表示第 q 個 input 的 desired output.X(3)

out=yq為 MLP 的實際輸出。

3

1

2)3(,

)3()3( )(2

1)()(

2

1 n

hhoutqhoutq

Toutqq xdxdxdE (3.40)


29


Using the steepest-descent gradient approach, the learning rule for a network weight in any one of the network layers is given by

where s=1,2,3

)()()(

sji

qssji w

Ew

(3.42)


30


The weights in the output layer can be updated according to

Using the chain rule for the partial derivatives, (3.42) can be rewritten as

)3()3()3(

ji

qji w

Ew

(3.42)

)3(

)3(

)3()3()3(

ji

j

j

qji w

v

v

Ew

(3.43)


31


分別計算 (3.43) 式的各項

or

)2(,

1

)2(,

)3()3()3(

2

iout

n

hhoutjh

jiji

(3)j xxw

ww

v

(3.44)

)3()3(

1

2)3()3()3(

3

)]([2

1jjqj

n

hhqh

jj

q gfdfdE

(3.45)

)3()3()3(,)3( ˆ jjjoutqj

j

q gxdE

(3.46)

Local error, delta

Eq 以 (4.30) 式代入


32


Combining (3.43), (3.44), (3.46), the learning rule for the weights in the output layer of the network is

or

In the hidden layer, applying the steepest descent gradient approach

)2(,

)3()3()3(ioutjji xw (3.47)

)2(,

)3()3()3()3( )()1( ioutjjiji xkwkw (3.48)

)2(

)2(

)2()2(

)2()2()2(

ji

j

j

q

ji

qji w

E

w

Ew

(3.49)


33


(3.49) 式右側的二階微分項，可表示成

)1(,

1

)1(,

)2()2()2(

)2(1

iout

n

hhoutjh

jiji

j xxwww

(3.50)

2

2

,

2

1 1

2

,

3

2

,

2

3 2

2

1

j

jout

h ppouthpqh

joutj

q xxwd

xE n n

f

(3.51)

223

1

3

233

1

3

,2

3

3

jjhjh

h

jhjhh

houtqh

j

q

gn

ggn

w

wxdE

(3.52)


34


Combining equation (3.49), (3.50), and (3.52) yields

or

xw joutjji

)1(

,

)2()2()2( (3.53)

xww ioutjjiji

kk1

,

22221 (3.54)


35


Generalized weights update form

where (for the output layer)

(for the hidden layers)

xww

s

iout

s

j

ss

ji

s

jikk

1

,1

(3.55)

s

j

s

joutqh

s

jgxd ,

(3.56)

s

jh

s

hj

s

h

s

jg

nS

w

1

1

11

(3.57)


36

Backpropagation Learning Algorithm (cont.) Standard BP algorithm

Step 1: Initialize the network synaptic weights to small random values.

Step 2: From the set of training input/output pair, present an input pattern and calculate the network response.

Step 3: The desired network response is compared with the actual output of the network, and by using (3.56) and (3.57) all the local errors can be computed.

Step 4: The weights of the network are updated according to (3.55).

Step 5: Until the network reaches a predetermined level of accuracy in producing the adequate response for all the training patterns, continue steps 2 through 4.


37

Backpropagation Learning Algorithm (cont.) Some Practical Issues in Using Standard BP Initialization of synaptic weights

Initially set to small random values. 若是設太大，很可能會造成 saturation.

Heuristic algorithm (Ham, Kostanic 2001) Set the weights are uniformly distributed random number

s in the interval from -0.5/fan-in to 0.5/fan-in


38

Backpropagation Learning Algorithm (cont.) Nguyen and Widrow’s initialization algorithm

Define ( 適合在具有一個隱藏層的架構 ) n0: number of components in input layer n1: number of neurons in hidden layer : scaling factor

Step 1: Compute the scaling factor according to

Step 2: Initialize the weights wij of a layer as random numbers between -0.5 and 0.5

Step 3: Reinitialize the weights according to

Step 4: For the i-th neuron in the hidden layer, set the bias to be a random number between –wij and wij.

n n0

17.0

n

iij

ij

ij

w

ww

1

1

2

(3.58)

(3.59)


39

Backpropagation Learning Algorithm (cont.) Network configuration and ability of the network to g

eneralize The configuration of the MLP NN 由下列所決定：

Hidden layer 的數量、每個 hidden layer 的神經元數量、神經元所採用的 activation function 。

Network performance 的影響不在於 activation function 的形式，而在於 hidden layer 的數量、每個 hidden layer 的神經元數量。

hidden layer 神經元數量是以 trial and error 的方式決定。 MLP NN 一般被設計成有 2 層 hidden layer 。一般而言，較多的 Hidden layer 神經元數量，可保證 good net

work performance ，但是一個” over-designed” 架構，會造成” over-fit” ，而喪失網路的 generalization 。


40

Backpropagation Learning Algorithm (cont.) Example 3.2

此網路具有一個 hidden layer ， 50 個神經元。要訓練一個非線性方程式 (a) 在 [0,4] 之間，每 0.2 取樣一點，共 21 點。 (b) 在 [0,4] 之間，每 0.01 取樣一點，共 400 點。 ( 造成 ove

rfitting)

)3sin( xey x


41

Backpropagation Learning Algorithm (cont.) Independent validation

使用 training data 來評估網路最後的 performance quality ，會造成 overfitting 。

可使用 independent validation 來避免此問題。將可用的 data 分成 training set 和 testing set 。一開始先將 data randomize ，接著，再將資料分成兩部分： training set 用來 update 網

路的權重。 Testing set 用來評估 training 的效能


42

Backpropagation Learning Algorithm (cont.) Speed of convergence

The convergence properties depend on the magnitude of the learning arte. 為保證網路能夠收斂，且避免訓練過程的震盪， learning

rate 必須設成相當小的值。若是網路的訓練起始點遠離 global minimum ，會造成許

多神經元的飽和，使得梯度變化變小，甚至卡在 error surface 的局部最小值，造成收斂速度的減慢。

快速演算法可分成兩大類： Consists of various heuristic improvements to the standard

BP algorithm Involves use of standard numerical optimization techniques


43

Backpropagation Learning Algorithm (cont.) Backpropagation Learning Algorithm with Mo

mentum Updating To update the weights in the direction which is a li

near combination of the current gradient of the instantaneous error surface and the one obtained in the previous step of the training.

The weights are updated according to

or

111

,

kkkk wxws

ji

s

iout

s

j

ss

ji

1111

,

1

,

kkkkkk xxwws

iout

s

j

s

iout

s

j

ss

ji

s

ji

(3.61)

(3.62)

Momentum term


44


This type of learning improves convergence If the training patterns contain some element of

uncertainly, then updating with momentum provides a sort of low-pass filtering by preventing rapid changes in the direction of the weight updates.

Render the training relatively immune to the presence of outliers or erroneous training pairs.

Increase the rate of weight change, and the speed of convergence is increased.


45


前述的幾點可表示成下列的 update equation

如果網路是操作在 error surface 的平坦區域，則 gradient 將不會改變，因此 (3.63) 可改寫成

由於 forgetting factor總是小於 1 ，所以 effective learning rate 可定成

11

kk w

wE

ws

jis

ji

qss

ji

wE

wE

wE

wE

wE

w

s

ji

q

s

s

ji

qs

s

ji

qs

s

ji

qs

s

ji

qss

jik

1

1

1

2

2

1

s

s

eff

(3.63)

(3.64)

(3.65)


46

Backpropagation Learning Algorithm (cont.) Batch Updating

標準的 BP 假設權重的更新是對每一個 input/output training pair 。

Batch-updating 則是累積許多的 training pattern才來更新權重值。 ( 可視為將許多個別 I/O pair 的修正量平均後再修正權重 ) 。

Batch-updating 具有下列優點： Gives a much better estimate of the error surface. Provides some inherent low-pass filtering of the training

pattern. Suitable for more sophisticated optimization procedures.


47

Backpropagation Learning Algorithm (cont.) Search-Then-Converge Method

A kind of heuristic strategy for speeding up BP Search phase ：

The network is relatively far from the global minimum. The learning rate is kept sufficiently large and relatively constant.

Converge phase The network is approaching the global minimum. The learni

ng rate is decreased at each iteration.


48


因為實際上無法知道網路距離 global minimum 有多遠，因此可以下列兩式來估計：

基本上， 1<c/0<100, 100<k0<500 當 k<<k0 ， learning rate近似於 0 。 (search phase) 當 k>>k0 ， learning rate 以 1/k 到 1/k2的比例減少 (con

verage phase)

0

01

1

kkk

20000

00

01

1

kkkkkc

kkck

(3.66)

(3.67)


49

Backpropagation Learning Algorithm (cont.) Batch Updating with Variable Learning Rate

A simple heuristic strategy to increase the convergence speed of BP with batch update.

To increase the magnitude of the learning rate if the learning in the previous step has decreased the total error function.

If the error function has increased, the learning rate need to be decreased.


50


The algorithm can be summarized as If the error function over the entire training set has decreased,

increase the learning rate by multiplying it by a number >1 (typically =1.05)

If the error function has increased more than some set percentage , decrease the learning rate by multiplying it by a number <1 (typically =0.7)

If the error function is increased less than the percentage , the learning rate remains unchanged.

Apply the variable learning rate to batch updating can significantly speed up the convergence.

The algorithm easily be trapped in a local minimum. 可設定最小 learning rate min


51

Backpropagation Learning Algorithm (cont.) Vector-Matrix Form of the BP Algorithm

根據圖 (3.4) 可定義能量函數為：

使用 steepest descent approach ， the synaptic weight update equation can be written as

xdxdE qoutq

T

qoutqq

3

,

3

,2

1 (3.68)

k

kkw

Eww s

ij

qss

ij

s

ij

1

(3.69)


52


Applying the chain rule to (3.68)

(3.70) 的第二項可重寫成

(3.70) 的第一項，一般被稱為 sensitivity 項

w

EwE

s

ij

s

is

i

qs

ij

q

xxwww

s

jouth

s

hout

s

ihs

ij

s

ij

s

ins

1

,1

1

,

1

s

i

qs

i

E

(3.70)

(3.71)

(3.72)


53


Use (3.70) and (3.71) to rewrite (3.69) as

or in vector-matrix form

where D(s) represents the sensitivity vector of the sth layer

xww

s

jout

s

i

ss

ij

s

ijkk

1

,1

xDWW

s

out

sssskk

11

T

sq

sq

sqs

ns

EEED

,,,

21

(3.73)

(3.74)

(3.75)


54

Backpropagation Learning Algorithm (cont.) Calculation of the sensitivities

The sensitivity vector of the output layer

where g(x)=df(x) 為 activation function 為為為為為

33

,

33

1

23

3

33

33

'

2

1

2

1

3

iioutqiiiqi

hhqh

i

outq

T

outq

ii

q

gff

nf

xdd

d

xdxdE

(3.76)


55


將 (3.76) 代入 (3.75) ，則輸出層的 sensitivity vector 可表示成

使用 (3.75) 式的定義及 chain rule ，第二層的 sensitively vector 可表示成輸出層的 sensitively vector的函數

xdGD outq

333

DEE

DT

q

T

q 3

2

3

32

3

2

2

(3.77)

(3.78)


56


其中， (3.78) 式中的 sensitivity 可表示成 Jacobian matrix

第一層的 sensitivity vector 可表示成

T

n

nnn

n

n

T

2

2

3

32

2

3

32

1

3

3

2

2

3

22

2

3

22

1

3

2

2

2

3

12

2

3

12

1

3

1

2

3

DDT

2

1

21

(3.79)

(3. 80)


57


(3.79) 式 Jacobian matrix 中的元素

將 (3.81) 式代入 (3.79) 式， Jacobian matrix 可改寫成

其中

2323

2

2

3

2

2

,3

1

2

,

3

22

3

'

2

jijjij

j

j

ij

j

jout

ijh

houtih

jj

i

gff

n

www

xwxw

WGGWTT

T

3223

2

3

)2(2

)2(2

)2(1

)2( ,...,, nvgvgvgdiagvG

(3.81)

(3.82)


58


最後將 (3.82) 式及 (3.78) 式合併後，可得

同理，第一層的 sensitivity 可表示成

DWGD

T 3322

DWGD

T 2211

(3.83)

(3.84)


59

Backpropagation Learning Algorithm (cont.) Vector-matrix form of the BP algorithm

Step 1: Present an input pattern and calculate the outputs of the network and all the internal layers.

Step 2: For each of the layers, calculate the sensitivity vector according to

Step 3: Update the synaptic weights of the network according to

Continue steps 1 through 3 until the network reseachs the desired mapping accuracy.

layershidden allfor

layersoutput for )()()1()1(

)()()(

sTsss

soutq

ss

DWvGD

xdvGD

Tsout

ssss xDWkW )1()()()()( )1(


60

Accelerated Learning BP Algorithm Conjugate Gradient BP for the Feedforward

Multilayer Perceptron Conjugated Gradient 是一種著名的數值方法，用來解不同的最佳化問題，它可視為最陡坡降法和牛頓法的折衷。

假設在解的附近，網路中所有權重的誤差函數可被一個二次方程式精確的估計：

P

p h

TT

phph

nwQw

PwJ

s

bwyd1 1

2

2

1

2

1(3.85)

訓練樣本數

輸出層神經元個數

第 p 次訓練的第 h個神經元的 desired


61

Accelerated Learning BP Algorithm (cont.)

(3.85) 式中的 Q 是一個二次微分的方陣， Hessian Matrix 。

Conjugated gradient algorithm企圖尋找 error surface 的 conjugate 方向，並且在此方向上 update 權重。

由於 Hessian matrix Q 的計算相當複雜，因此大部分的 conjugated gradient algorithm嘗試尋找 conjugate 方向而不明確計算 Hessian matrix 。對每一個神經元架構一組 normal equation ，然後再使用

conjugate gradient法疊代的去解此 normal equation 。


62

Accelerated Learning BP Algorithm (cont.) Normal equations for the linear combiner

MLP NN 的第 s 層的第 i 個 linear combiner 的輸出可表示成

假設對每一個 combiner 的 desired 為已知，則學習的目標在於使 cost function 的平方差最小：

xw

s

qout

Ts

i

s

i

1

,

M

q

s

qi

s

qi

s

i dJ1

2

,,2

1

(3.86)

(3.87)

訓練樣本數


63


將 (3.86) 代入 (3.87) 式，可得

J 對 w作偏微分

定義

M

q

s

qout

Ts

i

s

qi

s

i xwdJ1

21

,,2

1

0

1

1

,

1

,

1

,,

M

q

s

i

Ts

qout

s

qout

s

qout

s

qis

i

s

i wxxxdwJ

xxC

Ts

qout

M

q

s

qout

s

i

1

,1

1

,

xdp s

qout

M

q

s

qi

s

i

1

,1

,

(3.88)

(3.89)

(3.90)

(3.91)


64


式 (3.89) 可改寫成

Ci(s)可視為第 s 層輸入向量的 covariance matrix 的

估計， pi(s)則為第 s 層的輸入和 linear combiner 的

desired output 間的 cross-correlation vector. 網路的訓練可被視為包含他們的解的一個過程

希望 desired 和 actual output 的差為 0 。

pwC

s

i

s

i

s

i (3.92)


65

Backpropagation Learning Algorithm (cont.) BP with Adaptive Slopes of Activation Function

For the standard BP, the rate of the update is proportional to the derivative of the nonlinear activation function.

See Fig. 3.6, 在網路訓練的過程，神經元的輸出可能會進入激發函數的飽和值，使得激發函數的微分值相當小，造成網路的學習相當緩慢。解決方法：

降低激發函數的斜率，以減少飽和區的範圍。 Optimal activation function slope that balances the speed o

f the network training and its mapping capability


66


Typical MLP NN activation function Derivative of the activation function in (a)


67


使用與 (3.40) 式相同的 minimize performance criterion

其中， s 表示網路的層級。 d 為 desired 輸出 , x 為實際輸出。

Sigmoid activation function

其中為輸入，為斜率。

sn

h

shoutqh

soutq

Tsoutqq xdxdxdE

1

2)(,

)()(

2

1

2

1

)exp(1

)exp(1,

f

(3.113)

(3.114)


68


s 層第 i 個神經元的斜率更新率為

利用 chain rule ，則 (3.115) 式的第二項可改寫成

)()()( )()1( s

i

qsi

si

Ekk

,

,1 )()(

)(,

)()(,

)(

)(

)(,

)(,

)(

)()(

f

fx

x

x

x

EE

sis

i

siout

si

siout

si

si

siout

siout

si

si

qs

i

q

(3.115)

(3.116)

為為為為為為為為為為為為為為為為為為為為為


69


若採用 (3.114) 式的 activation function ，則

將 (3.116~3.118) 代入 (3.115) ，則激發函數的動態斜率更新方程式為

2,12

1, ff

2,12

1, ff

)()()( )()1( si

si

si kk

(3.117)

(3.118)

(3.119)


70

Bcakpropagation algorithm with adaptive activation function slopes Step 1: Initialize the weights in the network according to the

standard initialization procedure. Step 2: From the set of the training input/output pairs, present

the input pattern and calculate the network response. Step 3: Compare the desired network response with the actual

output of the network, and the local errors are computed using

layershidden for

layeroutput for

)()1(

1

)1(

)()(,

2s

is

hi

n

h

sh

si

si

sioutqh

si

vgw

vgxd


71

Bcakpropagation algorithm with adaptive activation function slopes (Cont.) Step 4: The weights of the network are updated

according to

Step 5:The slopes of the activation functions are updated according to

Step 6: Stop if the network has converged; else go back to step 2.

)(,

)()()()( 1 sjout

si

ssij

sij xkwkw

11 )()()()( kkkk si

si

si

sij

sij

momentumterm


72

Levenberg-Marquardt Algorithm The Levenberg-Marquardt BP Algorithm represents

a simplified version of Newton’s method applied to the problem of training MLP NNs.

Summary of the Newton’s optimization algorithm Finding a vector w that minimizes a given energy function E

(w). According to Newton’s method, an iterative procedure will a

ccomplish this minimization task by follows steps: Step 1: Initialize the components of the vector w to some ra

ndom values. Step 2: Update the vector w according to

kkkk gHww 1)()1( (3.120)


73

Levenberg-Marquardt Algorithm (cont.)

Where matrix H-1k represents the inverse of the He

ssian matrix. The Hessian matrix is given by

The vector gk represents the gradient of the energy function, computed as

2

2

2

2

1

2

2

2

22

2

12

21

2

21

2

21

2

)()()(

)()()(

)()()(

NNN

N

N

w

wE

ww

wE

ww

wE

ww

wE

w

wE

ww

wEww

wE

ww

wE

w

wE

H

T

Nw

wE

w

wE

w

wEg

)()()(

21

(3.121)

(3.122)


74

Levenberg-Marquardt Algorithm (cont.)

然而， (3.120) 式有一個致命的問題，即需要解 Hessian matrix 的反矩陣，因此， LMBP提供一可行的進似取代。

如圖 3.4 的 MLP ，網路的訓練過程可視為在找尋一組權重值，使所有 pattern 在網路的 desired 和 actual output 間的誤差最小化的過程，如果 pattern 的數量有限，則 energy function 可寫成

其中， Q 是 training pattern 的總數， w 為所有權重的向量， dq為 desired ， x(3)

out, q為第 q 個訓練 pattern 時網路的實際輸出。

Q

q

n

hqhoutqhqoutq

Q

q

T

qoutq xdxdxdwE1 1

2)3(,

)3(,

1

)3(,

3

2

1

2

1)( (3.123)


75

Levenberg-Marquardt Algorithm (cont.) 根據 Newton’s method ，使能量函數最小化的最佳化的

權重可由下式獲得 :

其中

令 P=n3Q ，則 (3.123) 可寫成

其中

kk gHkwkw 1)()1(

)(2 )( kwwk wEH

)()( kwwk wEg

(3.124)

(3.125)

(3.126)

p

p

p

pppoutp exdwE

1 1

22)3(, 2

1

2

1)(

(3.127)

)3(, poutpp xde (3.128)


76

Levenberg-Marquardt Algorithm (cont.) (3.126) 式的 gradient 可由下式獲得

其中

eJ

w

ee

w

ee

w

ee

w

e

w

e

w

e

w

wEg T

N

pP

Pp

pP

Pp

pP

Pp

N

P

Pp

P

Pp

P

Pp

1

21

11

1

2

2

1

2

1

1

2

2

1)(

N

PPP

N

N

w

e

w

e

w

e

w

e

w

e

w

ew

e

w

e

w

e

J

21

2

2

2

1

2

1

2

1

1

1

(3.129)

(3.130)


77

Levenberg-Marquardt Algorithm (cont.) Hessian matrix 的第 k,j 個元素可表示成

使用 (3.130) 式的 Jacobian matrix型式， Hessian matrix可表示成

其中， S 為一個二階微分矩陣

當接近能量函數的最小值時，矩陣 S 的元素會變得很小，因此， Hessian matrix 可進似成

kj

Pp

j

P

k

P

jkjk ww

ee

w

e

w

e

ww

wEwE

22

,2 )(

)(

SJJwE T )(2

p

P

pp eeS

1

2

(3.131)

(3.132)

(3.133)

JJH T (3.134)


78

Levenberg-Marquardt Algorithm (cont.) 將 (3.129) 和 (3.134) 代入 (3.124) 的 newton’s method ，

可得

(3.135) 式可能會在求 H=JTJ 的反矩陣時出現問題，因此對 (3.134)做小幅的修改

其中，是一微小值， I 為 identity matrix 將 (3.136) 代入 (3.135) 可得

(3.135) kTkk

Tk eJJJkwkw

1)()1(

IJJH T

kTkk

Tk eJIJJkwkw

1)()1(

(3.136)

(3.137)


79

Levenberg-Marquardt Algorithm (cont.) (3.137) 式其實是最陡坡降法和牛頓法之間的轉換當 (3.137) 式中的值很小時，則 (3.137)近似於 (3.135)

式的牛頓法 ; 當 (3.137) 式中的值增大時，則 (3.137) 式中括號內的第

二項會主宰 (3.137) 式，因此 (3.137) 式可改寫成

定義 k=1/k ，並使用 (3.129) 式，則 (3.138) 可改寫成

k

Tk

kk

Tkk

Tkk

Tk

eJkweJIkw

eJIJJkwkw

1

)()(

)()1(

1

1

kk gkwkw )()1(

(3.138)

(3.139)

為最陡坡降法的形式


80

Levenberg-Marquardt Algorithm (cont.) 實現 LMBP演算法的最大問題在於 Jacobian matrix J(w)

的計算，矩陣中的每個元素的計算如下 :

(3.140) 式的微分式，可用下式來近似

其中 ei表示因微小的權重值變化 wj所造成的輸出誤差的改變量

j

iji w

eJ

,

j

iji w

eJ

,

(3.140)

(3.141)


81

Levenberg-Marquardt Algorithm (cont.) Levenberg-Marquardt BP Algorithm

Step 1: Initialize the network weights to small random values. Set the learning rate parameter.

Step 2: Present an input pattern, and calculate the output of the network.

Step 3: Use (3.141) to calculate the elements of the Jacobian matrix associated with the input/output pairs.

Step 4: When the last input/output pair is presented, use (3.137) to perform the update of the weights.

Step 5: Stop if the network has converged; else go back to Step 2.


82

Levenberg-Marquardt Algorithm (cont.) Additional remarks:

The weight update approach presented here is a batch version of the LMBP algorithm.

The learning rate parameter k can be modified dynamically during the learning process.

The Jacobian matrix used in the update equation need not be computed for the entire set of input/output training pairs.


83

Counterpropagation

Counterpropagation networks 由 Hecht-Nielsen (1987) 所提出。它的功能如同一個自規劃 (self-programming) 的最佳查詢表 (lookup table) ，提供輸入和輸出訓練樣本間的雙向對應。

Counterpropagation networks 的收斂速度比 MLP快，但其所需的神經元數量比 MLP 多。

Counterpropagation networks 的架構分成兩種： Forward-only counterpropagation networks Full counterpropagation networks


84

Counterpropagation (cont.) Forward-only counterpropagation networks

由輸入層 (input layer) 、隱藏層 (hidden layer) 、及輸出層 (output layer) 所組成。

兩組權重由不同的 learning algorithm 來調整。 Kohonen’s self-organizing learning rule Grossberg’s learning rule


85

Counterpropagation (cont.)

在網路訓練的過程，需要提供 desired pattern 。 Kohonen 和 Grossberg 層的權重是分別訓練的。

計算輸入向量和輸入層與隱藏層間權重的距離。

因此，隱藏層的輸出可由下列條件計算出來：

( 神經元的權重與輸入樣本最接近的神經元輸出 1 ，其他的輸出0)

n

kjkkjjj wxwxwxdistz

1

2

2,

otherwise0

j allfor ,zfor which integer smallest theis i if1 jii

zz

(3.142)

(3.143)


86


接著，根據 Kohonen’s self-organizing learning rule 更新權重值： (只更新 winner neuron ，其他的神經元的權重值不更新，而且學習速率是逐次降低的 ) 。

輸出層的權重，則根據 Grossberg’s learning rule 更新權重值：

(只有與隱藏層中 winner 神經元相連的權重值需要更新 )

xkkwkkw jj 11

ijjijiji zykukkuku 1

(3.144)

(3.145)


87


Algorithm for training counterpropagation networks Step 1: 決定分類的數量 N( 隱藏層神經元的個數 ) ，並且以亂數的形式指定權重值。

Step 2:提供輸入樣本 x ，和對應的輸出 y 。 Step 3:根據 (3.142) 計算輸入樣本和輸入權重間的距離。 Step 4:根據 (3.143) 計算隱藏層的輸出。 Step 5:根據 (3.144) 更新 Kohonen 層的權重。 Step 6:根據 (3.145) 更新 Grossberg 層的權重。 Step 7: 如果網路收斂則停止，否則回到 Step 2繼續執行。


88

Counterpropagation (cont.) Example 3.3

Designing a neural network to approximate the functiony=1/(x+1) in the interval[0,4].

A forward-only counter propagation network that has 20 neurons in the hidden layer.

The learning rate is set to 0.95 and decreased according to

The network trained for 50 epochs.

10exp)( 0

kk


89


Full counterpropagation Forward-only counterpropagation networks只提供單方向的訓練與對應。 ( 由輸入端輸入，可求得對應的輸出 )

Full counterpropagation 可提供雙向的訓練與對應。( 由輸入端輸入，可求得對應的輸出，反之亦然 )

Full counterpropagation networks 有四組權重值：輸入層 x ， y 與分類層間的權重以 Kohonen’s法來訓練。分類層與兩個輸出層間的權重以 Grossberg’s法來訓練。


90



91

Algorithm for training full counterpropagation networks Step 1: 決定分類的數量 N( 隱藏層神經元的個數 ) ，並且以亂數的形式指

定權重值。 W 和 V 的初始值是介於 x 輸入的最大與最小值之間。 U 和 T 的初始值是介於 y 輸入的最大與最小值之間。

Step 2:根據下式計算輸入樣本 x 和輸入 y 間的距離。

Step 3:根據下式計算隱藏層的輸出。

Step 4:根據下式更新 Kohonen 層的權重。

Step 5:根據下式更新 Grossberg 層的權重。

Step 6: 如果網路收斂則停止，否則回到 Step 2繼續執行。

n

kkik

n

kkikj uywxz

1

2,

1

2,

otherwise0

j allfor ,zfor which integer smallest theis i if1 jii

zz

ykkukku

xkkwkkw

yiyi

xixi

11

11

ijjiyjiji

ijjixjiji

zyktkktkt

zxkvkkvkv

1

1


92

Radial Basis Function Neural Networks Radial Basis Function Neural Networks

模擬大腦皮質層軸突的局部調整功能，具備良好的映射能力。被訓練來執行輸入和輸出向量空間的非線性對應。 RBFNN 是以函數逼近的方式來建構網路 RBF NN 由三層神經元所組成： input layer 、 hidden layer

(又稱為 nonlinear processing layer) 、及輸出層 (output layer)


93

Radial Basis Function Neural Networks (cont.)

RBF NN 的輸出定義如下：

其中， N 為隱藏層神經元的個數， x 為輸入向量， c 為輸入向量空間的 RBF 中心， w 為輸出層的權重值。

K(.) 有許多種形式如下所示：

N

k

N

kkkikkkikii micxwcxwxfy

1 12

,...,2,1 ,,

function aticmultiquadr inverse 1

function aticmultiquadr

functiongaussian /exp

function spline-plate- thin ln

ionapproximat cubic

functionlinear

22

22

22

2

3

xx

xx

xx

xxx

xx

xx

(3.146)

最常使用


94


上述的 activation function 中，用來控制 RBF 的寬度，稱為 spread parameter ，中心點 ck是被定義的點，執行輸入向量空間的適當取樣。


95


在隱藏層中的每個神經元，計算輸入 xi和其對應的中心 ci之間的 Euclidean距離，並將此距離經 fk後，當成隱藏層的輸出。

最後將此輸出乘上對應的權重加總，即得輸出 yi。因為輸出層的神經元之間並無相互關係。因此 y1~y

m可視為由多組單一輸出，共用隱藏層的多輸出網路。


96


Training the RBF NN with Fixed Centers 支配 RBF NN 對應性質的兩個參數

隱藏層和輸出層間的 weight ， wik

Radial basis function 的中心點， ck

Broomhead 建議從輸入樣本中隨機選擇一組輸入樣本當為中心點。根據訓練樣本的 PDF選擇足夠多的 centers ，但很難去量化說多少

的中心點才是適當的。因此，一開始可選用相當大數量的中心點，再使用系統化的方法移除一些不會降低網路對應能力的中心點。

一旦中心點選定之後，網路的輸出可表示成

其中 Q 為輸入樣本的總數

Qqcqxwqy k

N

kik 1,2,..., ,),()(ˆ

1

(3.147)

選擇了 N 個中心點


97


將 (3.147) 重新以向量的型式表示成

或

NN

N

N

w

w

w

cQxcQxcQx

cxcxcx

cxcxcx

Qy

y

y

...

)),(()),(()),((

)),2(()),2(()),2((

)),1(()),1(()),1((

)(ˆ

)2(ˆ

)1(ˆ

2

1

21

21

21

wy ˆ

(3.148)

(3.149)

網路的實際輸出輸出層的權重

網路實際輸出

第 Q 個輸入和第 n 個中心的 Euclidean distance


98


因為中心點 ck是固定的，因此網路在隱藏層間的對應也是固定，所以網路的訓練只在於調整輸出權重，因此，實際和 desired 輸出的 MSE 可表示成：

將 (3.149) 式代入 (3.150)

)ˆ()ˆ(2

1)](ˆ)([

2

1)( 2

1

yyyyqyqywJ dT

d

Q

qd

)2(2

1)()(

2

1)( wwwyyywywywJ TTT

ddTdd

Td

(3.150)

(3.151)


99


將 (3.151) 式最小化，以求得 w

0)(

w

wJ

0 wy Td

T

ddTT yyw 1)(

(3.152)

(3.153)

(3.154)

pseudoinverse

如果中心點的數目大於等於 training pattern 的數量，則網路的實際輸出和 desired 輸出的差會很小。


100


從 (3.154) 可看出在固定網路中心的情況下，網路的訓練會得到一個固定解，意味著 RBF NN 比 BP和 counterpropagation 的訓練更快速。

如果中心的數目大於等於 training pattern 的數目，則網路的實際輸出和 desired 輸出之間的差異會很小，事實上如果使用 (3.154) ，則 (3.151) 式的 error 會等於 0 。


101

Radial Basis Function Neural Networks (cont.) Example 3.4

要訓練一個 RBFNN 來近似非線性方程式 Interval [0,4] ， hidden layer: 21 neurons 。 Fig. (a) :training ：取樣間距 0.2 ，所以隱藏層有 21 個神經元。

( 中心點即為此 21 個取樣值， Gaussian RBF, =0.2) Fig. (b) :testing ：取樣間距 0.01 ，有 401 個取樣點。

)3sin( xey x

些許的 verfit ，相對於 BP 已經好很多。


102


Setting the spread parameter spread parameter 通常以下列的 heuristic 求得

使用 (3.155) 式，隱藏層中神經元的 radial basis function 可表示成

K

dmax

2

2max

exp),( kk cxd

Kcx

(3.155)

(3.156)

選擇的中心點之間最大的 Euclidean 距離

中心點的數量


103


RBF NN Training algorithm with fixed centers Step 1:

Choose the centers for the RBF functions Step 2:

Calculate the spread parameter for the RBF functions according to (3.155)

Step 3: Initialized the weights in the output layer of the network

Step 4: Calculate the output of the neural network according to

(3.149) Step 5:

Solve for the network weights, using (3.154)


104


Training the RBF NN using the Stochastic Gradient Approach 固定中心點的 RBF NN只有輸出層的權重可以調整，訓練上很簡單。但要從輸入向量中找出適當的中心點數量卻不容易。因此必須從輸入的資料中選擇大量的中心點，造成即使是相當簡單的問題，網路的架構仍然相當複雜。

Stochastic gradient approach 的 RBF NN允許調整網路的所有的三種網路參數 ( 權重、中心點位置及 RBF 的寬度 )


105

定義 instantaneous error cost function

若 RBF選擇 Gaussian ，則 (3.157) 變成

網路參數的更新方程式如下：

2

1

2)(),()()(

2

1)(

2

1)(

N

kkkd ncnxnwnynenJ

2

12

2

2

)(

)()(exp)()(

2

1)(

N

k k

kkd n

ncnxnwnynJ

)()()()1( nwww nJw

nwnw

(3.157)

(3.158)

)(

)()()1(nCCk

ckk

KK

nJc

ncnc

)(

)()()1(nk

kk

kk

nJnn

(3.159)

(3.160)

(3.161)

利用最陡坡降法， J(n)分別對 w, c, 進行偏微分。


106

Radial Basis Function Neural Networks (cont.) RBF NN Training algorithm with stochastic gradient-based

method Step 1:

Choose the centers for the RBF functions from input vectors randomly.

Step 2: Calculate the initial value of the spread parameter for the RBF

function according to (3.155). Step 3:

Initialize the weights in the output layer of the network to some small random values.

Step 4: Present an input vector, and compute the network output according

to

N

kkkk cnxwny

1

,),()(


107


Step 5: Update the network parameters .

其中

Step 6: Stop if the network has converged; else, go back to step 4.

)()()()1( nnenwnw w

)()(),(),()(

)()()()1(

2ncnxncnx

n

nwnencnc kkk

k

kckk

2

3)()(),(),(

)(

)()()()1( ncnxncnx

n

nwnenn kkk

k

kkk

TNNcnxcnxcnxn ,),(,...,,),(,,),()( 2211

)()()( nynyne d

與課本不同


108


stochastic gradient-based method 的特性相對於 output layer 的權重， instantaneous error cost function 是

一個曲面 (convex) ，但實際上， RBF NN 的中心點和 spread parameter 並不一定是 convex ，因此會造成 training algorithm 會有卡在 local minimum 的傾向。

一般而言， Learning rate parameters w, c, 會設成不同值 The learning rules are still less complex than BP. 因為 RBF NN只需調整 output layer 的權重。


109

Radial Basis Function Neural Networks (cont.) Orthogonal Least Squares

設計 RBF NN 的主要挑戰在於中心點的選擇，通常是以隨機的方式或是 stochastic gradient 法，但總是造成相當大的網路。

OLS (Orthogonal Least Squares)提供一系統化的中心點選擇方式，可大量的降低 RBF NN 的大小。將所有訓練範例皆視為潛在的中心點逐次從潛在的中心點選取新的中心點 (選取使網路輸出誤差最少者為新中心點 )

重複步驟二，逐次由輸入範例中挑選一個輸出誤差最小者做為新加入的中心點，直到符合 OLS選取中心點的停止標準時，停止

Gram-Schmidt orthogonalization 將一個矩陣 M 分解兩個矩陣乘積的過程。 ( 將一組M 向量轉換

成一組垂直基底向量 W ，使垂直基底W涵蓋的空間和 M 一樣 )

M=WA (3.162)


110


其中 A 為上三角矩陣

W 由 m 個相互正交的向量所組成

Orthogonal decomposition theorem Any vector can be decomposed uniquely with respect to the s

ubspace into two mutually orthogonal parts. One part is parallel to the subspace Y , and the other is perpendicular to it

1...00

............

...10

...1

2

112

m

m

A

),...,,(diag 21 mT hhhWW

(3.163)

(3.164)

emm

(3.165)e┴Y將 M投影至空間 Y


111

Radial Basis Function Neural Networks (cont.) Gram-Schmidt orthogonalization algorithm

Step 1: Set the first basis vector equal to one of the columns of

matrix M.

Step k: Extract the k-th basis vector so that it is orthogonal to the

previous k-1 vectors

Repeat step k until k=m

11 mw

iTkik wm

11 ki

1

1

k

iiikkk wmw


112

Orthogonal decomposition theorem

m1=w1

m2

m3

w2

w3

222

321

11

3133

111

2122

11

www

mww

ww

mwmw

www

mwmw

mw


113

Radial Basis Function Neural Networks (cont.) Orthogonal Squares Regression

RBF NN 可視為一種迴歸模型 (regression model)當選取適當的中心點，並建構 RBFNN 後，將所有training pattern 代入 (3.166) 可得

或

QNNNQQQ

NN

NN

d

d

d

e

e

e

w

w

w

cxcxcx

cxcxcx

cxcxcx

Qy

y

y

2

1

2

1

2211

2222112

1221111

),,(),,(),,(

),,(),,(),,(

),,(),,(),,(

)(

)2(

)1(

ewyd

(3.166)

(3.167)

Errors between desired and actual network output

共有 Q 個中心點，若選用全部 Q 個中心點，可得到 error-free mapping ，但會造成網路 size 太大。因此可利用 OLS 來選擇 N 個中心點，N<Q


114


在 OLS迴歸的每個階段，可選擇使 desired output 的變異量增加最大的輸入當成新的中心點。

假設選擇 N<Q 的中心點 The least-squares solution yielding the weights

The actual output of the network

By using the Gram-Schmidt orthogonalization, the regression matrix can be decomposed as

dyw

wwy N ,, 21

1000

10

1

,,, 223

11211

21

N

N

N

aa

aaa

bbbBA

(3.168)

(3.169)

(3.170)


115


(3.170) 式中， A 為上三角矩陣，且對角線為 1 。B 則是由彼此相互正交的向量所組成

矩陣 H 為一對角矩陣，其陣列元素 hi的值為

將 (3.170) 代入 (3.167) 式，可得

),,,(diag 21 hhhHBBT

N

kiki

Tii bbbh

1

2

eBgeBAwyd

(3.171)

(3.172)

(3.173)


116


A least-squares solution fro the coordinate vector g is given as

The i-th coordinate of the vector is given by

目標輸出值 yd的平方和

dT

ddTT yBHyByBBBg 11)(

iTi

dTi

i bb

ybg

N

i

Tii

TTTTTd

Td eegheeHggeeBgBgyy

1

2

(3.174)

(3.175)

(3.176)

g


117


Defined the error reduction ratio due to the inclusion of the p-th RBF center as

中心點的選取流程是從 training pattern 資料輸入點中，逐漸選出誤差下降率最大值 [err]p,max的點，當作新加入的中心點，並將 [err]p,max累加，直到滿足所設定的正確率為止。

d

Td

ppp

yy

gh 2

err (3.177)


118

Radial Basis Function Neural Networks (cont.) Orthogonal least-squares algorithm for training an R

BF network Step 1:

K=1. For 1 <=i<=Q, setCompute the error reduction ratio of the ith center as

Find

and set

and the center c1=ci1

iib )(

1

d

Td

iTid

Tii

yybb

yb

)(1

)(1

2)(1

1err

Qierr ii 1,maxerr 111

11 ib


119

Step k: Compute

Set

Compute

Find

and select

and the center ck=cik

121 ,...,, ,1For .2 kiiiiiiQik

jTj

iTji

jk bb

ba

)( 11 kj

1

1

)()(k

jj

ijki

ik bab

d

Td

ik

Tik

dTi

kik

yybb

yb

)()(

2)()(err

121 ,,,,1,errmaxerr kik

ik iiiiiiQik

)( kikk bb


120


Step k+1 Repeat step k. The regression is stopped at step N1

when

where 0<<1 is a selected tolerance value

1

1

err1N

jj


121


新增的向量使網路的 desired 輸出最大化。包含在迴歸矩陣中的每個向量對應到輸出資料的一個中心點。

當迴歸有足夠的 desired network output 的能量後，新增向量的動作將會停止。最後 OLS 產生一相對小的網路。

參數對於網路的精確度和複雜度的平衡有重大的影響，如果太大，則網路的精確度高，但中心點太多，造成 overfit-ing 現象。若太小，網路的精確度降低，但網路較精簡。

一般而言，設定在

2

2

1d

n


122


Example 3.5 Design an RBF NN that approximates the mappin

g z=f(x,y)=cos(3x)sin(2y), -1<=x<=1, -1<=y<=1

121 個中心點 Spread parameter=0.3 目的在於訓練網路能對應此 121 個中心點。經過 OLS 後，剩下 28 個中心點，減少了 77% 。


123


chapter 3 mapping networks

Documents