chapter 3 mapping networks

123
Chapter 3 Mapping Networks 國國國國國國國國 國國國國國國國 國國國 (Chuan-Yu Chang ) 國國 Office: ES 709 TEL: 05-5342601 ext. 4337 E-mail: [email protected]

Upload: penha

Post on 17-Jan-2016

70 views

Category:

Documents


5 download

DESCRIPTION

Chapter 3 Mapping Networks. 國立雲林科技大學 資訊工程研究所 張傳育 (Chuan-Yu Chang ) 博士 Office: ES 709 TEL: 05-5342601 ext. 4337 E-mail: [email protected]. Introduction. Mapping Networks Associative memory networks Feed-forward multilayer perceptrons Counter-propagation networks - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Chapter 3 Mapping Networks

Chapter 3Mapping Networks

國立雲林科技大學 資訊工程研究所張傳育 (Chuan-Yu Chang ) 博士Office: ES 709

TEL: 05-5342601 ext. 4337

E-mail: [email protected]

Page 2: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

2

Introduction

Mapping Networks Associative memory networks Feed-forward multilayer perceptrons Counter-propagation networks Radial Basis Function networks.

Mapping ANN(.)

x y

Page 3: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

3

Associative Memory Networks 記憶容量 (memory capability) 對於資訊系統而言,記憶 (remember) 並且降低 (reduce) 所儲存的資訊量是很重要的。

儲存的資訊必須被適當的存放在網路的記憶體中,也就是說,給予一個關鍵的 (key) 輸入或刺激 (stimulus) ,將從聯想式記憶體 (associative memory) 中擷取記憶化的樣本 (memorized pattern) ,以適當的方式輸出。

Page 4: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

4

Associative Memory Networks (cont.) 在神經生物學的系統中,記憶 (memory) 的概念與神經因環境和有機結構的互動改變有關。如果變化不存在,則不會有記憶存在。

如果記憶是有用的,則它必須是可被存取的( 從神經系統 ) ,所以會發生學習 (learning) ,而且也可以擷取 (retrieval) 資訊。

一個樣本 (pattern) 可經由學習的過程 (learning process) 存在記憶體。

Page 5: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

5

Associative Memory Networks (cont.) 依據記憶時間的長短,有兩種記憶形式:

Long-term memory Short-time memory

有兩種基本的 associative memory Autoassociative memory

A key input vector is associated to itself. The input and output space dimensions are the same.

Heteroassociative memory Key input vectors are associated with arbitrary memorized vec

tors. The output space dimension may be different from the input s

pace dimension.

Page 6: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

6

General Linear Distributed Associative Memory General Linear Distributed Associative Memory 的

學習過程給網路一個 key input pattern (vector) ,然後記憶體轉換這個向量 (vector) 成一個儲存 ( 記憶 )的 pattern.

Page 7: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

7

General Linear Distributed Associative Memory (cont.) Input (key input pattern)

Output (memorized pattern)

對一個 n-dimension 的神經架構,可聯想 (associate) h 個 pattern 。 h<=n 。實際上是 h < n 。

Tknkkk xxxx ,...,, 21

Tknkkk yyyy ,...,, 21

(3.1)

(3.2)

Page 8: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

8

General Linear Distributed Associative Memory (cont.) Key vector xk 和記憶向量 yk間的線性對應可寫成:

其中 W(k) 為權重矩陣 (weight matrix) 從這些權重矩陣,可建構 Memory matrix M

describes the sum of the weight matrices for every input/output pair. This can be written as

the memory matrix can be thought of as representing the collective experience.

kk xkWy )(

h

k

kWM1

)(

hkkWMM kk ,,2,1for ),(1

(3.3)

(3.4)

(3.5)

00 M

Page 9: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

9

Correlation Matrix Memory

Estimation of the memory matrix Outer product rule

Each outer product matrix is an estimate of the weight matrix W(k) that maps the output pattern yk onto the input pattern xk.

An associative memory designed in outer product rule is called a correlation matrix memory

h

k

Tkk xyM

1

ˆ

hkxyMM Tkkkk ,,2,1for ,ˆˆ

1

(3.6)

(3.7)

Page 10: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

10

Correlation Matrix Memory (cont.)

當某 key pattern 輸入時,希望 memory matrix 能 recall 適當的 memorized pattern ,假設 estimated memory matrix 已經由 (3.6) 學習了 h 次,則輸入第 q 個 key input xq,欲求 memorized yq

(3.8) 式可改寫成 ( 將 (3.6) 代入 (3.8))

假設每個 key input vector 具有 normalized unit length.

kqTk

h

kq

Tkkq yxxxxyxMy )(ˆ

1 (3.8)

kq

h

qkk

Tkqq

Tq yxxyxxy )()(

1

(3.9)

hkxx kTk ,,2,1for 1 (3.10)

Page 11: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

11

Correlation Matrix Memory (cont.)

此 normalized unit length 可藉由除以 Euclidean norm 來求得:

因此, (3.9) 可改寫成

其中

2k

kk

x

xx

(3.11)

h

qkk

qqkqTkq zyyxxyy

1

)( (3.12)

h

qkk

kqTkq yxxz

1

)( (3.13)

Noise, crosstalk

Desired response

Page 12: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

12

Correlation Matrix Memory (cont.)

從 (3.13) 當網路引進 associated key pattern (y=yq, zq=0) 若 y=yq表示輸入 xq可精確的得到其對應的輸出 yq。 若 zq不為 0 ,表示有 noise 或 crosstalk 。

若 key input 之間為 orthogonal ,則 crosstalk 為 0 因此可得到 perfect memorized pattern 。

若輸入的 key input 之間存在 linearly independent ,則可在計算 memory matrix 之前先使用 Gram-Schmidt orthogonalization 來產生 orthogonal input key vectors 。

1

1

)(k

ii

iTi

kTi

kk ggg

xgxg (3.14)

Page 13: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

13

Correlation Matrix Memory (cont.)

The storage capacity of the associative memory is

The storage limit depends on the rank of the memory matrix.

nM ˆ

Page 14: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

14

Correlation Matrix Memory (cont.) Example 3.1

Autoassociative memoryThe memory is trained with three key vectors:

Each of these vectors has unit length.

5556.0

6667.0

4969.0

7027.0

5556.0

4444.0

5329.0

7778.0

3333.0

321 xxx (3.15)

Page 15: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

15

Correlation Matrix Memory (cont.)

The angles between these vectors:

These three vectors are far from being mutually orthogonal.

o

2221

21112 9.101cos

xx

xx T

(3.16)

o

2321

31113 5.49cos

xx

xx T

(3.17)

o

2322

32123 1.76cos

xx

xx T

(3.18)

Page 16: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

16

Correlation Matrix Memory (cont.)

The memory matrix of the autoassociative network

Using (3.19), the estimate of the input key patterns

0865.13945.04107.0

3945.03582.11749.0

4107.01749.05555.0ˆ

332211TTT xxxxxxM

(3.19)

0707.1

0378.1

3876.0ˆˆ

7268.0

5551.0

6326.0ˆˆ

7489.0

3249.1

1023.0ˆˆ 332211 xMxxMxxMx (3.20)

Page 17: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

17

Correlation Matrix Memory (cont.)

The Euclidean distance of the response vector from each of the key vectors.

9108.0ˆ

9589.1ˆ 6319.0ˆ

21331

2122121111

xx

xxxx

(3.21)

2412.1ˆ

1898.0ˆ 6575.1ˆ

22332

2222222112

xx

xxxx

(3.22)

6442.0ˆ

6363.1ˆ 9363.0ˆ

23333

2322323113

xx

xxxx

(3.23)

有較小的 error

Page 18: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

18

Correlation Matrix Memory (cont.) Another key vector with unit length

The angles between these vectors

7349.0

2370.0

6354.0

6533.0

0587.0

7548.0

1629.0

9779.0

1309.0

321 xxx (3.24)

o

2221

21112 9.92cos

xx

xx T

(3.25)

o

2321

31113 3.88cos

xx

xx T

(3.26)

o

2322

32123 8.90cos

xx

xx T

(3.27)

Page 19: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

19

Correlation Matrix Memory (cont.)

The memory matrix is

The estimate of input key patterns

9934.00532.00048.0

0532.00159.10217.0

0048.00217.09906.0ˆ

332211TTT xxxxxxM (3.28)

7396.0

2661.0

6207.0ˆˆ

6558.0

1108.0

7521.0ˆˆ

1092.0

9876.0

1501.0ˆˆ 332211 xMxxMxxMx (3.29)

Page 20: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

20

Correlation Matrix Memory (cont.)

The Euclidean distance of the response vector from each of the key vectors.

3758.1ˆ

4865.1ˆ 0579.0ˆ

21331

2122121111

xx

xxxx

(3.30)

4382.1ˆ

0522.0ˆ 4859.1ˆ

22332

2222222112

xx

xxxx

(3.31)

0329.0ˆ

4365.1ˆ 3734.1ˆ

23333

2322323113

xx

xxxx

(3.32)

相較於 (3.21)~(3.23) 有更低的 error 。

Page 21: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

21

Correlation Matrix Memory (cont.) An error correction approach for correlation

matrix memories The drawback of associative memory is the

relatively large number of errors that can occur during recall.

The simplest correlation matrix memory has no provision for correcting errors Lack of feedback from the output to input.

Page 22: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

22

Correlation Matrix Memory (cont.)

The objective of the error correction approach is to have the associative memory reconstruct the memorized pattern in an optimal sense.

The error vector is defined as

其中, yk是 desired pattern with the key input pattern xk. The discrete-time learning rule based on steepest descent

kkk xMye )(ˆ)(

ˆ)(ˆ)1(ˆˆ

MMM

(3.33)

(3.34)

Page 23: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

23

Correlation Matrix Memory (cont.)

其中 energy function 定義成

Computing the gradient of (3.35)

將 (3.36) 代入 (3.34)

2

2

2

2)(ˆ

2

1

2

1)ˆ( kkk xMyeM

Tkk

TkkM

xxMxyM ˆ)ˆ(ˆ

Tkkk xxMyMM ])(ˆ[)(ˆ)1(ˆ

(3.35)

(3.36)

(3.37)

Error vector

Page 24: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

24

Correlation Matrix Memory (cont.)

(3.37) 式的第二項中的中括號的內容,為 error vector( 參見 (3.33)) ,因此具有 error correction 的效果。

將 (3.37) 展開可改寫成

和 (3.7) 比較起來,多了第二項。 (3.37) 式基於 error correction 的監督式學習是對 h個 association 的每個不同 pattern 重複訓練所得到。

Tkk

Tkk xxMxyMM )(ˆ)(ˆ)1(ˆ

hkyx kk ,,2,1for ,

(3.38)

(3.39)

Page 25: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

25

Correlation Matrix Memory (cont.) Selecting the learning rate parameter

Fixed learning rate Adjustable learning rate with respect to time For each association, the iterative adjustments to t

he memory matrix (3.37) continue until the error vector ek (3.33) becomes negligibly small.

The initial memory matrix M(0)=0. 和 (2.29) 比較,發現 (3.37) 也是 LMS 。

Page 26: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

26

Backpropagation Learning Algorithm Backpropagation algorithm

Generalized delta rule Training MLPs with backpropagation algorithms results in a

nonlinear mapping. The MLPs can have its synaptic weights adjusted by the B

P algorithm to develop a specific nonlinear mapping. The fixed weights after the training process can provide an

association task for classification, pattern recognition, diagnosis, etc.

During the training phase of the MLP, the synaptic weights are adjusted to minimize the disparity between the actual and desired outputs of the MLP, averaged over all input patterns.

Page 27: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

27

Backpropagation Learning Algorithm (cont.) Basic BP algorithm for the feedforward MLP

3 layers network -1 output layer -2 hidden layers

Page 28: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

28

Backpropagation Learning Algorithm (cont.)

The standard BP algorithm for training of the MLP NN is based on the steepest descent gradient approach applied to the minimization of an energy function representing the instantaneous error.

其中 dq表示第 q 個 input 的 desired output.X(3)

out=yq為 MLP 的實際輸出。

3

1

2)3(,

)3()3( )(2

1)()(

2

1 n

hhoutqhoutq

Toutqq xdxdxdE (3.40)

Page 29: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

29

Backpropagation Learning Algorithm (cont.)

Using the steepest-descent gradient approach, the learning rule for a network weight in any one of the network layers is given by

where s=1,2,3

)()()(

sji

qssji w

Ew

(3.42)

Page 30: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

30

Backpropagation Learning Algorithm (cont.)

The weights in the output layer can be updated according to

Using the chain rule for the partial derivatives, (3.42) can be rewritten as

)3()3()3(

ji

qji w

Ew

(3.42)

)3(

)3(

)3()3()3(

ji

j

j

qji w

v

v

Ew

(3.43)

Page 31: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

31

Backpropagation Learning Algorithm (cont.)

分別計算 (3.43) 式的各項

or

)2(,

1

)2(,

)3()3()3(

2

iout

n

hhoutjh

jiji

(3)j xxw

ww

v

(3.44)

)3()3(

1

2)3()3()3(

3

)]([2

1jjqj

n

hhqh

jj

q gfdfdE

(3.45)

)3()3()3(,)3( ˆ jjjoutqj

j

q gxdE

(3.46)

Local error, delta

Eq 以 (4.30) 式代入

Page 32: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

32

Backpropagation Learning Algorithm (cont.)

Combining (3.43), (3.44), (3.46), the learning rule for the weights in the output layer of the network is

or

In the hidden layer, applying the steepest descent gradient approach

)2(,

)3()3()3(ioutjji xw (3.47)

)2(,

)3()3()3()3( )()1( ioutjjiji xkwkw (3.48)

)2(

)2(

)2()2(

)2()2()2(

ji

j

j

q

ji

qji w

E

w

Ew

(3.49)

Page 33: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

33

Backpropagation Learning Algorithm (cont.)

(3.49) 式右側的二階微分項,可表示成

)1(,

1

)1(,

)2()2()2(

)2(1

iout

n

hhoutjh

jiji

j xxwww

(3.50)

2

2

,

2

1 1

2

,

3

2

,

2

3 2

2

1

j

jout

h ppouthpqh

joutj

q xxwd

xE n n

f

(3.51)

223

1

3

233

1

3

,2

3

3

jjhjh

h

jhjhh

houtqh

j

q

gn

ggn

w

wxdE

(3.52)

Page 34: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

34

Backpropagation Learning Algorithm (cont.)

Combining equation (3.49), (3.50), and (3.52) yields

or

xw joutjji

)1(

,

)2()2()2( (3.53)

xww ioutjjiji

kk1

,

22221 (3.54)

Page 35: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

35

Backpropagation Learning Algorithm (cont.)

Generalized weights update form

where (for the output layer)

(for the hidden layers)

xww

s

iout

s

j

ss

ji

s

jikk

1

,1

(3.55)

s

j

s

joutqh

s

jgxd ,

(3.56)

s

jh

s

hj

s

h

s

jg

nS

w

1

1

11

(3.57)

Page 36: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

36

Backpropagation Learning Algorithm (cont.) Standard BP algorithm

Step 1: Initialize the network synaptic weights to small random values.

Step 2: From the set of training input/output pair, present an input pattern and calculate the network response.

Step 3: The desired network response is compared with the actual output of the network, and by using (3.56) and (3.57) all the local errors can be computed.

Step 4: The weights of the network are updated according to (3.55).

Step 5: Until the network reaches a predetermined level of accuracy in producing the adequate response for all the training patterns, continue steps 2 through 4.

Page 37: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

37

Backpropagation Learning Algorithm (cont.) Some Practical Issues in Using Standard BP Initialization of synaptic weights

Initially set to small random values. 若是設太大,很可能會造成 saturation.

Heuristic algorithm (Ham, Kostanic 2001) Set the weights are uniformly distributed random number

s in the interval from -0.5/fan-in to 0.5/fan-in

Page 38: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

38

Backpropagation Learning Algorithm (cont.) Nguyen and Widrow’s initialization algorithm

Define ( 適合在具有一個隱藏層的架構 ) n0: number of components in input layer n1: number of neurons in hidden layer : scaling factor

Step 1: Compute the scaling factor according to

Step 2: Initialize the weights wij of a layer as random numbers between -0.5 and 0.5

Step 3: Reinitialize the weights according to

Step 4: For the i-th neuron in the hidden layer, set the bias to be a random number between –wij and wij.

n n0

17.0

n

iij

ij

ij

w

ww

1

1

2

(3.58)

(3.59)

Page 39: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

39

Backpropagation Learning Algorithm (cont.) Network configuration and ability of the network to g

eneralize The configuration of the MLP NN 由下列所決定:

Hidden layer 的數量、每個 hidden layer 的神經元數量、神經元所採用的 activation function 。

Network performance 的影響不在於 activation function 的形式,而在於 hidden layer 的數量、每個 hidden layer 的神經元數量。

hidden layer 神經元數量是以 trial and error 的方式決定。 MLP NN 一般被設計成有 2 層 hidden layer 。 一般而言,較多的 Hidden layer 神經元數量,可保證 good net

work performance ,但是一個” over-designed” 架構,會造成” over-fit” ,而喪失網路的 generalization 。

Page 40: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

40

Backpropagation Learning Algorithm (cont.) Example 3.2

此網路具有一個 hidden layer , 50 個神經元。 要訓練一個非線性方程式 (a) 在 [0,4] 之間,每 0.2 取樣一點,共 21 點。 (b) 在 [0,4] 之間,每 0.01 取樣一點,共 400 點。 ( 造成 ove

rfitting)

)3sin( xey x

Page 41: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

41

Backpropagation Learning Algorithm (cont.) Independent validation

使用 training data 來評估網路最後的 performance quality ,會造成 overfitting 。

可使用 independent validation 來避免此問題。 將可用的 data 分成 training set 和 testing set 。 一開始先將 data randomize , 接著,再將資料分成兩部分: training set 用來 update 網

路的權重。 Testing set 用來評估 training 的效能

Page 42: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

42

Backpropagation Learning Algorithm (cont.) Speed of convergence

The convergence properties depend on the magnitude of the learning arte. 為保證網路能夠收斂,且避免訓練過程的震盪, learning

rate 必須設成相當小的值。 若是網路的訓練起始點遠離 global minimum ,會造成許

多神經元的飽和,使得梯度變化變小,甚至卡在 error surface 的局部最小值,造成收斂速度的減慢。

快速演算法可分成兩大類: Consists of various heuristic improvements to the standard

BP algorithm Involves use of standard numerical optimization techniques

Page 43: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

43

Backpropagation Learning Algorithm (cont.) Backpropagation Learning Algorithm with Mo

mentum Updating To update the weights in the direction which is a li

near combination of the current gradient of the instantaneous error surface and the one obtained in the previous step of the training.

The weights are updated according to

or

111

,

kkkk wxws

ji

s

iout

s

j

ss

ji

1111

,

1

,

kkkkkk xxwws

iout

s

j

s

iout

s

j

ss

ji

s

ji

(3.61)

(3.62)

Momentum term

Page 44: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

44

Backpropagation Learning Algorithm (cont.)

This type of learning improves convergence If the training patterns contain some element of

uncertainly, then updating with momentum provides a sort of low-pass filtering by preventing rapid changes in the direction of the weight updates.

Render the training relatively immune to the presence of outliers or erroneous training pairs.

Increase the rate of weight change, and the speed of convergence is increased.

Page 45: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

45

Backpropagation Learning Algorithm (cont.)

前述的幾點可表示成下列的 update equation

如果網路是操作在 error surface 的平坦區域,則 gradient 將不會改變,因此 (3.63) 可改寫成

由於 forgetting factor總是小於 1 ,所以 effective learning rate 可定成

11

kk w

wE

ws

jis

ji

qss

ji

wE

wE

wE

wE

wE

w

s

ji

q

s

s

ji

qs

s

ji

qs

s

ji

qs

s

ji

qss

jik

1

1

1

2

2

1

s

s

eff

(3.63)

(3.64)

(3.65)

Page 46: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

46

Backpropagation Learning Algorithm (cont.) Batch Updating

標準的 BP 假設權重的更新是對每一個 input/output training pair 。

Batch-updating 則是累積許多的 training pattern才來更新權重值。 ( 可視為將許多個別 I/O pair 的修正量平均後再修正權重 ) 。

Batch-updating 具有下列優點: Gives a much better estimate of the error surface. Provides some inherent low-pass filtering of the training

pattern. Suitable for more sophisticated optimization procedures.

Page 47: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

47

Backpropagation Learning Algorithm (cont.) Search-Then-Converge Method

A kind of heuristic strategy for speeding up BP Search phase :

The network is relatively far from the global minimum. The learning rate is kept sufficiently large and relatively constant.

Converge phase The network is approaching the global minimum. The learni

ng rate is decreased at each iteration.

Page 48: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

48

Backpropagation Learning Algorithm (cont.)

因為實際上無法知道網路距離 global minimum 有多遠,因此可以下列兩式來估計:

基本上, 1<c/0<100, 100<k0<500 當 k<<k0 , learning rate近似於 0 。 (search phase) 當 k>>k0 , learning rate 以 1/k 到 1/k2的比例減少 (con

verage phase)

0

01

1

kkk

20000

00

01

1

kkkkkc

kkck

(3.66)

(3.67)

Page 49: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

49

Backpropagation Learning Algorithm (cont.) Batch Updating with Variable Learning Rate

A simple heuristic strategy to increase the convergence speed of BP with batch update.

To increase the magnitude of the learning rate if the learning in the previous step has decreased the total error function.

If the error function has increased, the learning rate need to be decreased.

Page 50: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

50

Backpropagation Learning Algorithm (cont.)

The algorithm can be summarized as If the error function over the entire training set has decreased,

increase the learning rate by multiplying it by a number >1 (typically =1.05)

If the error function has increased more than some set percentage , decrease the learning rate by multiplying it by a number <1 (typically =0.7)

If the error function is increased less than the percentage , the learning rate remains unchanged.

Apply the variable learning rate to batch updating can significantly speed up the convergence.

The algorithm easily be trapped in a local minimum. 可設定最小 learning rate min

Page 51: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

51

Backpropagation Learning Algorithm (cont.) Vector-Matrix Form of the BP Algorithm

根據圖 (3.4) 可定義能量函數為:

使用 steepest descent approach , the synaptic weight update equation can be written as

xdxdE qoutq

T

qoutqq

3

,

3

,2

1 (3.68)

k

kkw

Eww s

ij

qss

ij

s

ij

1

(3.69)

Page 52: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

52

Backpropagation Learning Algorithm (cont.)

Applying the chain rule to (3.68)

(3.70) 的第二項可重寫成

(3.70) 的第一項,一般被稱為 sensitivity 項

w

EwE

s

ij

s

is

i

qs

ij

q

xxwww

s

jouth

s

hout

s

ihs

ij

s

ij

s

ins

1

,1

1

,

1

s

i

qs

i

E

(3.70)

(3.71)

(3.72)

Page 53: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

53

Backpropagation Learning Algorithm (cont.)

Use (3.70) and (3.71) to rewrite (3.69) as

or in vector-matrix form

where D(s) represents the sensitivity vector of the sth layer

xww

s

jout

s

i

ss

ij

s

ijkk

1

,1

xDWW

s

out

sssskk

11

T

sq

sq

sqs

ns

EEED

,,,

21

(3.73)

(3.74)

(3.75)

Page 54: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

54

Backpropagation Learning Algorithm (cont.) Calculation of the sensitivities

The sensitivity vector of the output layer

where g(x)=df(x) 為 activation function 為為為為為

33

,

33

1

23

3

33

33

'

2

1

2

1

3

iioutqiiiqi

hhqh

i

outq

T

outq

ii

q

gff

nf

xdd

d

xdxdE

(3.76)

Page 55: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

55

Backpropagation Learning Algorithm (cont.)

將 (3.76) 代入 (3.75) ,則輸出層的 sensitivity vector 可表示成

使用 (3.75) 式的定義及 chain rule ,第二層的 sensitively vector 可表示成輸出層的 sensitively vector的函數

xdGD outq

333

DEE

DT

q

T

q 3

2

3

32

3

2

2

(3.77)

(3.78)

Page 56: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

56

Backpropagation Learning Algorithm (cont.)

其中, (3.78) 式中的 sensitivity 可表示成 Jacobian matrix

第一層的 sensitivity vector 可表示成

T

n

nnn

n

n

T

2

2

3

32

2

3

32

1

3

3

2

2

3

22

2

3

22

1

3

2

2

2

3

12

2

3

12

1

3

1

2

3

DDT

2

1

21

(3.79)

(3. 80)

Page 57: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

57

Backpropagation Learning Algorithm (cont.)

(3.79) 式 Jacobian matrix 中的元素

將 (3.81) 式代入 (3.79) 式, Jacobian matrix 可改寫成

其中

2323

2

2

3

2

2

,3

1

2

,

3

22

3

'

2

jijjij

j

j

ij

j

jout

ijh

houtih

jj

i

gff

n

www

xwxw

WGGWTT

T

3223

2

3

)2(2

)2(2

)2(1

)2( ,...,, nvgvgvgdiagvG

(3.81)

(3.82)

Page 58: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

58

Backpropagation Learning Algorithm (cont.)

最後將 (3.82) 式及 (3.78) 式合併後,可得

同理,第一層的 sensitivity 可表示成

DWGD

T 3322

DWGD

T 2211

(3.83)

(3.84)

Page 59: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

59

Backpropagation Learning Algorithm (cont.) Vector-matrix form of the BP algorithm

Step 1: Present an input pattern and calculate the outputs of the network and all the internal layers.

Step 2: For each of the layers, calculate the sensitivity vector according to

Step 3: Update the synaptic weights of the network according to

Continue steps 1 through 3 until the network reseachs the desired mapping accuracy.

layershidden allfor

layersoutput for )()()1()1(

)()()(

sTsss

soutq

ss

DWvGD

xdvGD

Tsout

ssss xDWkW )1()()()()( )1(

Page 60: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

60

Accelerated Learning BP Algorithm Conjugate Gradient BP for the Feedforward

Multilayer Perceptron Conjugated Gradient 是一種著名的數值方法,用來解不同的最佳化問題,它可視為最陡坡降法和牛頓法的折衷。

假設在解的附近,網路中所有權重的誤差函數可被一個二次方程式精確的估計:

P

p h

TT

phph

nwQw

PwJ

s

bwyd1 1

2

2

1

2

1(3.85)

訓練樣本數

輸出層神經元個數

第 p 次訓練的第 h個神經元的 desired

Page 61: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

61

Accelerated Learning BP Algorithm (cont.)

(3.85) 式中的 Q 是一個二次微分的方陣, Hessian Matrix 。

Conjugated gradient algorithm企圖尋找 error surface 的 conjugate 方向,並且在此方向上 update 權重。

由於 Hessian matrix Q 的計算相當複雜,因此大部分的 conjugated gradient algorithm嘗試尋找 conjugate 方向而不明確計算 Hessian matrix 。 對每一個神經元架構一組 normal equation ,然後再使用

conjugate gradient法疊代的去解此 normal equation 。

Page 62: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

62

Accelerated Learning BP Algorithm (cont.) Normal equations for the linear combiner

MLP NN 的第 s 層的第 i 個 linear combiner 的輸出可表示成

假設對每一個 combiner 的 desired 為已知,則學習的目標在於使 cost function 的平方差最小:

xw

s

qout

Ts

i

s

i

1

,

M

q

s

qi

s

qi

s

i dJ1

2

,,2

1

(3.86)

(3.87)

訓練樣本數

Page 63: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

63

Accelerated Learning BP Algorithm (cont.)

將 (3.86) 代入 (3.87) 式,可得

J 對 w作偏微分

定義

M

q

s

qout

Ts

i

s

qi

s

i xwdJ1

21

,,2

1

0

1

1

,

1

,

1

,,

M

q

s

i

Ts

qout

s

qout

s

qout

s

qis

i

s

i wxxxdwJ

xxC

Ts

qout

M

q

s

qout

s

i

1

,1

1

,

xdp s

qout

M

q

s

qi

s

i

1

,1

,

(3.88)

(3.89)

(3.90)

(3.91)

Page 64: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

64

Accelerated Learning BP Algorithm (cont.)

式 (3.89) 可改寫成

Ci(s)可視為第 s 層輸入向量的 covariance matrix 的

估計, pi(s)則為第 s 層的輸入和 linear combiner 的

desired output 間的 cross-correlation vector. 網路的訓練可被視為包含他們的解的一個過程

希望 desired 和 actual output 的差為 0 。

pwC

s

i

s

i

s

i (3.92)

Page 65: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

65

Backpropagation Learning Algorithm (cont.) BP with Adaptive Slopes of Activation Function

For the standard BP, the rate of the update is proportional to the derivative of the nonlinear activation function.

See Fig. 3.6, 在網路訓練的過程,神經元的輸出可能會進入激發函數的飽和值,使得激發函數的微分值相當小,造成網路的學習相當緩慢。 解決方法:

降低激發函數的斜率,以減少飽和區的範圍。 Optimal activation function slope that balances the speed o

f the network training and its mapping capability

Page 66: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

66

Backpropagation Learning Algorithm (cont.)

Typical MLP NN activation function Derivative of the activation function in (a)

Page 67: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

67

Backpropagation Learning Algorithm (cont.)

使用與 (3.40) 式相同的 minimize performance criterion

其中, s 表示網路的層級。 d 為 desired 輸出 , x 為實際輸出。

Sigmoid activation function

其中為輸入,為斜率。

sn

h

shoutqh

soutq

Tsoutqq xdxdxdE

1

2)(,

)()(

2

1

2

1

)exp(1

)exp(1,

f

(3.113)

(3.114)

Page 68: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

68

Backpropagation Learning Algorithm (cont.)

s 層第 i 個神經元的斜率更新率為

利用 chain rule ,則 (3.115) 式的第二項可改寫成

)()()( )()1( s

i

qsi

si

Ekk

,

,1 )()(

)(,

)()(,

)(

)(

)(,

)(,

)(

)()(

f

fx

x

x

x

EE

sis

i

siout

si

siout

si

si

siout

siout

si

si

qs

i

q

(3.115)

(3.116)

為為為為為為為為為為為為為為為為為為為為為

Page 69: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

69

Backpropagation Learning Algorithm (cont.)

若採用 (3.114) 式的 activation function ,則

將 (3.116~3.118) 代入 (3.115) ,則激發函數的動態斜率更新方程式為

2,12

1, ff

2,12

1, ff

)()()( )()1( si

si

si kk

(3.117)

(3.118)

(3.119)

Page 70: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

70

Bcakpropagation algorithm with adaptive activation function slopes Step 1: Initialize the weights in the network according to the

standard initialization procedure. Step 2: From the set of the training input/output pairs, present

the input pattern and calculate the network response. Step 3: Compare the desired network response with the actual

output of the network, and the local errors are computed using

layershidden for

layeroutput for

)()1(

1

)1(

)()(,

2s

is

hi

n

h

sh

si

si

sioutqh

si

vgw

vgxd

Page 71: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

71

Bcakpropagation algorithm with adaptive activation function slopes (Cont.) Step 4: The weights of the network are updated

according to

Step 5:The slopes of the activation functions are updated according to

Step 6: Stop if the network has converged; else go back to step 2.

)(,

)()()()( 1 sjout

si

ssij

sij xkwkw

11 )()()()( kkkk si

si

si

sij

sij

momentumterm

Page 72: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

72

Levenberg-Marquardt Algorithm The Levenberg-Marquardt BP Algorithm represents

a simplified version of Newton’s method applied to the problem of training MLP NNs.

Summary of the Newton’s optimization algorithm Finding a vector w that minimizes a given energy function E

(w). According to Newton’s method, an iterative procedure will a

ccomplish this minimization task by follows steps: Step 1: Initialize the components of the vector w to some ra

ndom values. Step 2: Update the vector w according to

kkkk gHww 1)()1( (3.120)

Page 73: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

73

Levenberg-Marquardt Algorithm (cont.)

Where matrix H-1k represents the inverse of the He

ssian matrix. The Hessian matrix is given by

The vector gk represents the gradient of the energy function, computed as

2

2

2

2

1

2

2

2

22

2

12

21

2

21

2

21

2

)()()(

)()()(

)()()(

NNN

N

N

w

wE

ww

wE

ww

wE

ww

wE

w

wE

ww

wEww

wE

ww

wE

w

wE

H

T

Nw

wE

w

wE

w

wEg

)()()(

21

(3.121)

(3.122)

Page 74: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

74

Levenberg-Marquardt Algorithm (cont.)

然而, (3.120) 式有一個致命的問題,即需要解 Hessian matrix 的反矩陣,因此, LMBP提供一可行的進似取代。

如圖 3.4 的 MLP ,網路的訓練過程可視為在找尋一組權重值,使所有 pattern 在網路的 desired 和 actual output 間的誤差最小化的過程,如果 pattern 的數量有限,則 energy function 可寫成

其中, Q 是 training pattern 的總數, w 為所有權重的向量, dq為 desired , x(3)

out, q為第 q 個訓練 pattern 時網路的實際輸出。

Q

q

n

hqhoutqhqoutq

Q

q

T

qoutq xdxdxdwE1 1

2)3(,

)3(,

1

)3(,

3

2

1

2

1)( (3.123)

Page 75: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

75

Levenberg-Marquardt Algorithm (cont.) 根據 Newton’s method ,使能量函數最小化的最佳化的

權重可由下式獲得 :

其中

令 P=n3Q ,則 (3.123) 可寫成

其中

kk gHkwkw 1)()1(

)(2 )( kwwk wEH

)()( kwwk wEg

(3.124)

(3.125)

(3.126)

p

p

p

pppoutp exdwE

1 1

22)3(, 2

1

2

1)(

(3.127)

)3(, poutpp xde (3.128)

Page 76: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

76

Levenberg-Marquardt Algorithm (cont.) (3.126) 式的 gradient 可由下式獲得

其中

eJ

w

ee

w

ee

w

ee

w

e

w

e

w

e

w

wEg T

N

pP

Pp

pP

Pp

pP

Pp

N

P

Pp

P

Pp

P

Pp

1

21

11

1

2

2

1

2

1

1

2

2

1)(

N

PPP

N

N

w

e

w

e

w

e

w

e

w

e

w

ew

e

w

e

w

e

J

21

2

2

2

1

2

1

2

1

1

1

(3.129)

(3.130)

Page 77: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

77

Levenberg-Marquardt Algorithm (cont.) Hessian matrix 的第 k,j 個元素可表示成

使用 (3.130) 式的 Jacobian matrix型式, Hessian matrix可表示成

其中, S 為一個二階微分矩陣

當接近能量函數的最小值時,矩陣 S 的元素會變得很小,因此, Hessian matrix 可進似成

kj

Pp

j

P

k

P

jkjk ww

ee

w

e

w

e

ww

wEwE

22

,2 )(

)(

SJJwE T )(2

p

P

pp eeS

1

2

(3.131)

(3.132)

(3.133)

JJH T (3.134)

Page 78: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

78

Levenberg-Marquardt Algorithm (cont.) 將 (3.129) 和 (3.134) 代入 (3.124) 的 newton’s method ,

可得

(3.135) 式可能會在求 H=JTJ 的反矩陣時出現問題,因此對 (3.134)做小幅的修改

其中,是一微小值, I 為 identity matrix 將 (3.136) 代入 (3.135) 可得

(3.135) kTkk

Tk eJJJkwkw

1)()1(

IJJH T

kTkk

Tk eJIJJkwkw

1)()1(

(3.136)

(3.137)

Page 79: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

79

Levenberg-Marquardt Algorithm (cont.) (3.137) 式其實是最陡坡降法和牛頓法之間的轉換 當 (3.137) 式中的值很小時,則 (3.137)近似於 (3.135)

式的牛頓法 ; 當 (3.137) 式中的值增大時,則 (3.137) 式中括號內的第

二項會主宰 (3.137) 式,因此 (3.137) 式可改寫成

定義 k=1/k ,並使用 (3.129) 式,則 (3.138) 可改寫成

k

Tk

kk

Tkk

Tkk

Tk

eJkweJIkw

eJIJJkwkw

1

)()(

)()1(

1

1

kk gkwkw )()1(

(3.138)

(3.139)

為最陡坡降法的形式

Page 80: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

80

Levenberg-Marquardt Algorithm (cont.) 實現 LMBP演算法的最大問題在於 Jacobian matrix J(w)

的計算,矩陣中的每個元素的計算如下 :

(3.140) 式的微分式,可用下式來近似

其中 ei表示因微小的權重值變化 wj所造成的輸出誤差的改變量

j

iji w

eJ

,

j

iji w

eJ

,

(3.140)

(3.141)

Page 81: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

81

Levenberg-Marquardt Algorithm (cont.) Levenberg-Marquardt BP Algorithm

Step 1: Initialize the network weights to small random values. Set the learning rate parameter.

Step 2: Present an input pattern, and calculate the output of the network.

Step 3: Use (3.141) to calculate the elements of the Jacobian matrix associated with the input/output pairs.

Step 4: When the last input/output pair is presented, use (3.137) to perform the update of the weights.

Step 5: Stop if the network has converged; else go back to Step 2.

Page 82: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

82

Levenberg-Marquardt Algorithm (cont.) Additional remarks:

The weight update approach presented here is a batch version of the LMBP algorithm.

The learning rate parameter k can be modified dynamically during the learning process.

The Jacobian matrix used in the update equation need not be computed for the entire set of input/output training pairs.

Page 83: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

83

Counterpropagation

Counterpropagation networks 由 Hecht-Nielsen (1987) 所提出。 它的功能如同一個自規劃 (self-programming) 的最佳查詢表 (lookup table) ,提供輸入和輸出訓練樣本間的雙向對應。

Counterpropagation networks 的收斂速度比 MLP快,但其所需的神經元數量比 MLP 多。

Counterpropagation networks 的架構分成兩種: Forward-only counterpropagation networks Full counterpropagation networks

Page 84: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

84

Counterpropagation (cont.) Forward-only counterpropagation networks

由輸入層 (input layer) 、隱藏層 (hidden layer) 、及輸出層 (output layer) 所組成。

兩組權重由不同的 learning algorithm 來調整。 Kohonen’s self-organizing learning rule Grossberg’s learning rule

Page 85: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

85

Counterpropagation (cont.)

在網路訓練的過程,需要提供 desired pattern 。 Kohonen 和 Grossberg 層的權重是分別訓練的。

計算輸入向量和輸入層與隱藏層間權重的距離。

因此,隱藏層的輸出可由下列條件計算出來:

( 神經元的權重與輸入樣本最接近的神經元輸出 1 ,其他的輸出0)

n

kjkkjjj wxwxwxdistz

1

2

2,

otherwise0

j allfor ,zfor which integer smallest theis i if1 jii

zz

(3.142)

(3.143)

Page 86: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

86

Counterpropagation (cont.)

接著,根據 Kohonen’s self-organizing learning rule 更新權重值: (只更新 winner neuron ,其他的神經元的權重值不更新,而且學習速率是逐次降低的 ) 。

輸出層的權重,則根據 Grossberg’s learning rule 更新權重值:

(只有與隱藏層中 winner 神經元相連的權重值需要更新 )

xkkwkkw jj 11

ijjijiji zykukkuku 1

(3.144)

(3.145)

Page 87: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

87

Counterpropagation (cont.)

Algorithm for training counterpropagation networks Step 1: 決定分類的數量 N( 隱藏層神經元的個數 ) ,並且以亂數的形式指定權重值。

Step 2:提供輸入樣本 x ,和對應的輸出 y 。 Step 3:根據 (3.142) 計算輸入樣本和輸入權重間的距離。 Step 4:根據 (3.143) 計算隱藏層的輸出。 Step 5:根據 (3.144) 更新 Kohonen 層的權重。 Step 6:根據 (3.145) 更新 Grossberg 層的權重。 Step 7: 如果網路收斂則停止,否則回到 Step 2繼續執行。

Page 88: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

88

Counterpropagation (cont.) Example 3.3

Designing a neural network to approximate the functiony=1/(x+1) in the interval[0,4].

A forward-only counter propagation network that has 20 neurons in the hidden layer.

The learning rate is set to 0.95 and decreased according to

The network trained for 50 epochs.

10exp)( 0

kk

Page 89: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

89

Counterpropagation (cont.)

Full counterpropagation Forward-only counterpropagation networks只提供單方向的訓練與對應。 ( 由輸入端輸入,可求得對應的輸出 )

Full counterpropagation 可提供雙向的訓練與對應。( 由輸入端輸入,可求得對應的輸出,反之亦然 )

Full counterpropagation networks 有四組權重值: 輸入層 x , y 與分類層間的權重以 Kohonen’s法來訓練。 分類層與兩個輸出層間的權重以 Grossberg’s法來訓練。

Page 90: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

90

Counterpropagation (cont.)

Page 91: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

91

Algorithm for training full counterpropagation networks Step 1: 決定分類的數量 N( 隱藏層神經元的個數 ) ,並且以亂數的形式指

定權重值。 W 和 V 的初始值是介於 x 輸入的最大與最小值之間。 U 和 T 的初始值是介於 y 輸入的最大與最小值之間。

Step 2:根據下式計算輸入樣本 x 和輸入 y 間的距離。

Step 3:根據下式計算隱藏層的輸出。

Step 4:根據下式更新 Kohonen 層的權重。

Step 5:根據下式更新 Grossberg 層的權重。

Step 6: 如果網路收斂則停止,否則回到 Step 2繼續執行。

n

kkik

n

kkikj uywxz

1

2,

1

2,

otherwise0

j allfor ,zfor which integer smallest theis i if1 jii

zz

ykkukku

xkkwkkw

yiyi

xixi

11

11

ijjiyjiji

ijjixjiji

zyktkktkt

zxkvkkvkv

1

1

Page 92: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

92

Radial Basis Function Neural Networks Radial Basis Function Neural Networks

模擬大腦皮質層軸突的局部調整功能,具備良好的映射能力。 被訓練來執行輸入和輸出向量空間的非線性對應。 RBFNN 是以函數逼近的方式來建構網路 RBF NN 由三層神經元所組成: input layer 、 hidden layer

(又稱為 nonlinear processing layer) 、及輸出層 (output layer)

Page 93: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

93

Radial Basis Function Neural Networks (cont.)

RBF NN 的輸出定義如下:

其中, N 為隱藏層神經元的個數, x 為輸入向量, c 為輸入向量空間的 RBF 中心, w 為輸出層的權重值。

K(.) 有許多種形式如下所示:

N

k

N

kkkikkkikii micxwcxwxfy

1 12

,...,2,1 ,,

function aticmultiquadr inverse 1

function aticmultiquadr

functiongaussian /exp

function spline-plate- thin ln

ionapproximat cubic

functionlinear

22

22

22

2

3

xx

xx

xx

xxx

xx

xx

(3.146)

最常使用

Page 94: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

94

Radial Basis Function Neural Networks (cont.)

上述的 activation function 中,用來控制 RBF 的寬度,稱為 spread parameter ,中心點 ck是被定義的點,執行輸入向量空間的適當取樣。

Page 95: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

95

Radial Basis Function Neural Networks (cont.)

在隱藏層中的每個神經元,計算輸入 xi和其對應的中心 ci之間的 Euclidean距離,並將此距離經 fk後,當成隱藏層的輸出。

最後將此輸出乘上對應的權重加總,即得輸出 yi。 因為輸出層的神經元之間並無相互關係。因此 y1~y

m可視為由多組單一輸出,共用隱藏層的多輸出網路。

Page 96: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

96

Radial Basis Function Neural Networks (cont.)

Training the RBF NN with Fixed Centers 支配 RBF NN 對應性質的兩個參數

隱藏層和輸出層間的 weight , wik

Radial basis function 的中心點, ck

Broomhead 建議從輸入樣本中隨機選擇一組輸入樣本當為中心點。 根據訓練樣本的 PDF選擇足夠多的 centers ,但很難去量化說多少

的中心點才是適當的。因此,一開始可選用相當大數量的中心點,再使用系統化的方法移除一些不會降低網路對應能力的中心點。

一旦中心點選定之後,網路的輸出可表示成

其中 Q 為輸入樣本的總數

Qqcqxwqy k

N

kik 1,2,..., ,),()(ˆ

1

(3.147)

選擇了 N 個中心點

Page 97: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

97

Radial Basis Function Neural Networks (cont.)

將 (3.147) 重新以向量的型式表示成

NN

N

N

w

w

w

cQxcQxcQx

cxcxcx

cxcxcx

Qy

y

y

...

)),(()),(()),((

)),2(()),2(()),2((

)),1(()),1(()),1((

)(ˆ

)2(ˆ

)1(ˆ

2

1

21

21

21

wy ˆ

(3.148)

(3.149)

網路的實際輸出 輸出層的權重

網路實際輸出

第 Q 個輸入和第 n 個中心的 Euclidean distance

Page 98: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

98

Radial Basis Function Neural Networks (cont.)

因為中心點 ck是固定的,因此網路在隱藏層間的對應也是固定,所以網路的訓練只在於調整輸出權重,因此,實際和 desired 輸出的 MSE 可表示成:

將 (3.149) 式代入 (3.150)

)ˆ()ˆ(2

1)](ˆ)([

2

1)( 2

1

yyyyqyqywJ dT

d

Q

qd

)2(2

1)()(

2

1)( wwwyyywywywJ TTT

ddTdd

Td

(3.150)

(3.151)

Page 99: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

99

Radial Basis Function Neural Networks (cont.)

將 (3.151) 式最小化,以求得 w

0)(

w

wJ

0 wy Td

T

ddTT yyw 1)(

(3.152)

(3.153)

(3.154)

pseudoinverse

如果中心點的數目大於等於 training pattern 的數量,則網路的實際輸出和 desired 輸出的差會很小。

Page 100: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

100

Radial Basis Function Neural Networks (cont.)

從 (3.154) 可看出在固定網路中心的情況下,網路的訓練會得到一個固定解,意味著 RBF NN 比 BP和 counterpropagation 的訓練更快速。

如果中心的數目大於等於 training pattern 的數目,則網路的實際輸出和 desired 輸出之間的差異會很小,事實上如果使用 (3.154) ,則 (3.151) 式的 error 會等於 0 。

Page 101: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

101

Radial Basis Function Neural Networks (cont.) Example 3.4

要訓練一個 RBFNN 來近似非線性方程式 Interval [0,4] , hidden layer: 21 neurons 。 Fig. (a) :training :取樣間距 0.2 ,所以隱藏層有 21 個神經元。

( 中心點即為此 21 個取樣值, Gaussian RBF, =0.2) Fig. (b) :testing :取樣間距 0.01 ,有 401 個取樣點。

)3sin( xey x

些許的 verfit ,相對於 BP 已經好很多。

Page 102: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

102

Radial Basis Function Neural Networks (cont.)

Setting the spread parameter spread parameter 通常以下列的 heuristic 求得

使用 (3.155) 式,隱藏層中神經元的 radial basis function 可表示成

K

dmax

2

2max

exp),( kk cxd

Kcx

(3.155)

(3.156)

選擇的中心點之間最大的 Euclidean 距離

中心點的數量

Page 103: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

103

Radial Basis Function Neural Networks (cont.)

RBF NN Training algorithm with fixed centers Step 1:

Choose the centers for the RBF functions Step 2:

Calculate the spread parameter for the RBF functions according to (3.155)

Step 3: Initialized the weights in the output layer of the network

Step 4: Calculate the output of the neural network according to

(3.149) Step 5:

Solve for the network weights, using (3.154)

Page 104: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

104

Radial Basis Function Neural Networks (cont.)

Training the RBF NN using the Stochastic Gradient Approach 固定中心點的 RBF NN只有輸出層的權重可以調整,訓練上很簡單。但要從輸入向量中找出適當的中心點數量卻不容易。因此必須從輸入的資料中選擇大量的中心點,造成即使是相當簡單的問題,網路的架構仍然相當複雜。

Stochastic gradient approach 的 RBF NN允許調整網路的所有的三種網路參數 ( 權重、中心點位置及 RBF 的寬度 )

Page 105: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

105

定義 instantaneous error cost function

若 RBF選擇 Gaussian ,則 (3.157) 變成

網路參數的更新方程式如下:

2

1

2)(),()()(

2

1)(

2

1)(

N

kkkd ncnxnwnynenJ

2

12

2

2

)(

)()(exp)()(

2

1)(

N

k k

kkd n

ncnxnwnynJ

)()()()1( nwww nJw

nwnw

(3.157)

(3.158)

)(

)()()1(nCCk

ckk

KK

nJc

ncnc

)(

)()()1(nk

kk

kk

nJnn

(3.159)

(3.160)

(3.161)

利用最陡坡降法, J(n)分別對 w, c, 進行偏微分。

Page 106: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

106

Radial Basis Function Neural Networks (cont.) RBF NN Training algorithm with stochastic gradient-based

method Step 1:

Choose the centers for the RBF functions from input vectors randomly.

Step 2: Calculate the initial value of the spread parameter for the RBF

function according to (3.155). Step 3:

Initialize the weights in the output layer of the network to some small random values.

Step 4: Present an input vector, and compute the network output according

to

N

kkkk cnxwny

1

,),()(

Page 107: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

107

Radial Basis Function Neural Networks (cont.)

Step 5: Update the network parameters .

其中

Step 6: Stop if the network has converged; else, go back to step 4.

)()()()1( nnenwnw w

)()(),(),()(

)()()()1(

2ncnxncnx

n

nwnencnc kkk

k

kckk

2

3)()(),(),(

)(

)()()()1( ncnxncnx

n

nwnenn kkk

k

kkk

TNNcnxcnxcnxn ,),(,...,,),(,,),()( 2211

)()()( nynyne d

與課本不同

Page 108: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

108

Radial Basis Function Neural Networks (cont.)

stochastic gradient-based method 的特性 相對於 output layer 的權重, instantaneous error cost function 是

一個曲面 (convex) ,但實際上, RBF NN 的中心點和 spread parameter 並不一定是 convex ,因此會造成 training algorithm 會有卡在 local minimum 的傾向。

一般而言, Learning rate parameters w, c, 會設成不同值 The learning rules are still less complex than BP. 因為 RBF NN只需調整 output layer 的權重。

Page 109: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

109

Radial Basis Function Neural Networks (cont.) Orthogonal Least Squares

設計 RBF NN 的主要挑戰在於中心點的選擇,通常是以隨機的方式或是 stochastic gradient 法,但總是造成相當大的網路。

OLS (Orthogonal Least Squares)提供一系統化的中心點選擇方式,可大量的降低 RBF NN 的大小。 將所有訓練範例皆視為潛在的中心點 逐次從潛在的中心點選取新的中心點 (選取使網路輸出誤差最少者為新中心點 )

重複步驟二,逐次由輸入範例中挑選一個輸出誤差最小者做為新加入的中心點,直到符合 OLS選取中心點的停止標準時,停止

Gram-Schmidt orthogonalization 將一個矩陣 M 分解兩個矩陣乘積的過程。 ( 將一組M 向量轉換

成一組垂直基底向量 W ,使垂直基底W涵蓋的空間和 M 一樣 )

M=WA (3.162)

Page 110: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

110

Radial Basis Function Neural Networks (cont.)

其中 A 為上三角矩陣

W 由 m 個相互正交的向量所組成

Orthogonal decomposition theorem Any vector can be decomposed uniquely with respect to the s

ubspace into two mutually orthogonal parts. One part is parallel to the subspace Y , and the other is perpendicular to it

1...00

............

...10

...1

2

112

m

m

A

),...,,(diag 21 mT hhhWW

(3.163)

(3.164)

emm

(3.165)e┴Y將 M投影至空間 Y

Page 111: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

111

Radial Basis Function Neural Networks (cont.) Gram-Schmidt orthogonalization algorithm

Step 1: Set the first basis vector equal to one of the columns of

matrix M.

Step k: Extract the k-th basis vector so that it is orthogonal to the

previous k-1 vectors

Repeat step k until k=m

11 mw

iTkik wm

11 ki

1

1

k

iiikkk wmw

Page 112: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

112

Orthogonal decomposition theorem

m1=w1

m2

m3

w2

w3

222

321

11

3133

111

2122

11

www

mww

ww

mwmw

www

mwmw

mw

Page 113: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

113

Radial Basis Function Neural Networks (cont.) Orthogonal Squares Regression

RBF NN 可視為一種迴歸模型 (regression model)當選取適當的中心點,並建構 RBFNN 後,將所有training pattern 代入 (3.166) 可得

QNNNQQQ

NN

NN

d

d

d

e

e

e

w

w

w

cxcxcx

cxcxcx

cxcxcx

Qy

y

y

2

1

2

1

2211

2222112

1221111

),,(),,(),,(

),,(),,(),,(

),,(),,(),,(

)(

)2(

)1(

ewyd

(3.166)

(3.167)

Errors between desired and actual network output

共有 Q 個中心點,若選用全部 Q 個中心點,可得到 error-free mapping ,但會造成網路 size 太大。因此可利用 OLS 來選擇 N 個中心點,N<Q

Page 114: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

114

Radial Basis Function Neural Networks (cont.)

在 OLS迴歸的每個階段,可選擇使 desired output 的變異量增加最大的輸入當成新的中心點。

假設選擇 N<Q 的中心點 The least-squares solution yielding the weights

The actual output of the network

By using the Gram-Schmidt orthogonalization, the regression matrix can be decomposed as

dyw

wwy N ,, 21

1000

10

1

,,, 223

11211

21

N

N

N

aa

aaa

bbbBA

(3.168)

(3.169)

(3.170)

Page 115: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

115

Radial Basis Function Neural Networks (cont.)

(3.170) 式中, A 為上三角矩陣,且對角線為 1 。B 則是由彼此相互正交的向量所組成

矩陣 H 為一對角矩陣,其陣列元素 hi的值為

將 (3.170) 代入 (3.167) 式,可得

),,,(diag 21 hhhHBBT

N

kiki

Tii bbbh

1

2

eBgeBAwyd

(3.171)

(3.172)

(3.173)

Page 116: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

116

Radial Basis Function Neural Networks (cont.)

A least-squares solution fro the coordinate vector g is given as

The i-th coordinate of the vector is given by

目標輸出值 yd的平方和

dT

ddTT yBHyByBBBg 11)(

iTi

dTi

i bb

ybg

N

i

Tii

TTTTTd

Td eegheeHggeeBgBgyy

1

2

(3.174)

(3.175)

(3.176)

g

Page 117: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

117

Radial Basis Function Neural Networks (cont.)

Defined the error reduction ratio due to the inclusion of the p-th RBF center as

中心點的選取流程是從 training pattern 資料輸入點中,逐漸選出誤差下降率最大值 [err]p,max的點,當作新加入的中心點,並將 [err]p,max累加,直到滿足所設定的正確率為止。

d

Td

ppp

yy

gh 2

err (3.177)

Page 118: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

118

Radial Basis Function Neural Networks (cont.) Orthogonal least-squares algorithm for training an R

BF network Step 1:

K=1. For 1 <=i<=Q, setCompute the error reduction ratio of the ith center as

Find

and set

and the center c1=ci1

iib )(

1

d

Td

iTid

Tii

yybb

yb

)(1

)(1

2)(1

1err

Qierr ii 1,maxerr 111

11 ib

Page 119: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

119

Step k: Compute

Set

Compute

Find

and select

and the center ck=cik

121 ,...,, ,1For .2 kiiiiiiQik

jTj

iTji

jk bb

ba

)( 11 kj

1

1

)()(k

jj

ijki

ik bab

d

Td

ik

Tik

dTi

kik

yybb

yb

)()(

2)()(err

121 ,,,,1,errmaxerr kik

ik iiiiiiQik

)( kikk bb

Page 120: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

120

Radial Basis Function Neural Networks (cont.)

Step k+1 Repeat step k. The regression is stopped at step N1

when

where 0<<1 is a selected tolerance value

1

1

err1N

jj

Page 121: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

121

Radial Basis Function Neural Networks (cont.)

新增的向量使網路的 desired 輸出最大化。包含在迴歸矩陣中的每個向量對應到輸出資料的一個中心點。

當迴歸有足夠的 desired network output 的能量後,新增向量的動作將會停止。最後 OLS 產生一相對小的網路。

參數對於網路的精確度和複雜度的平衡有重大的影響,如果太大,則網路的精確度高,但中心點太多,造成 overfit-ing 現象。若太小,網路的精確度降低,但網路較精簡。

一般而言,設定在

2

2

1d

n

Page 122: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

122

Radial Basis Function Neural Networks (cont.)

Example 3.5 Design an RBF NN that approximates the mappin

g z=f(x,y)=cos(3x)sin(2y), -1<=x<=1, -1<=y<=1

121 個中心點 Spread parameter=0.3 目的在於訓練網路能對應此 121 個中心點。 經過 OLS 後,剩下 28 個中心點,減少了 77% 。

Page 123: Chapter 3 Mapping Networks

Chuan-Yu Chang Ph.D.

123

Radial Basis Function Neural Networks (cont.)