chapter 3 mapping networks
DESCRIPTION
Chapter 3 Mapping Networks. 國立雲林科技大學 資訊工程研究所 張傳育 (Chuan-Yu Chang ) 博士 Office: ES 709 TEL: 05-5342601 ext. 4337 E-mail: [email protected]. Introduction. Mapping Networks Associative memory networks Feed-forward multilayer perceptrons Counter-propagation networks - PowerPoint PPT PresentationTRANSCRIPT
Chapter 3Mapping Networks
國立雲林科技大學 資訊工程研究所張傳育 (Chuan-Yu Chang ) 博士Office: ES 709
TEL: 05-5342601 ext. 4337
E-mail: [email protected]
Chuan-Yu Chang Ph.D.
2
Introduction
Mapping Networks Associative memory networks Feed-forward multilayer perceptrons Counter-propagation networks Radial Basis Function networks.
Mapping ANN(.)
x y
Chuan-Yu Chang Ph.D.
3
Associative Memory Networks 記憶容量 (memory capability) 對於資訊系統而言,記憶 (remember) 並且降低 (reduce) 所儲存的資訊量是很重要的。
儲存的資訊必須被適當的存放在網路的記憶體中,也就是說,給予一個關鍵的 (key) 輸入或刺激 (stimulus) ,將從聯想式記憶體 (associative memory) 中擷取記憶化的樣本 (memorized pattern) ,以適當的方式輸出。
Chuan-Yu Chang Ph.D.
4
Associative Memory Networks (cont.) 在神經生物學的系統中,記憶 (memory) 的概念與神經因環境和有機結構的互動改變有關。如果變化不存在,則不會有記憶存在。
如果記憶是有用的,則它必須是可被存取的( 從神經系統 ) ,所以會發生學習 (learning) ,而且也可以擷取 (retrieval) 資訊。
一個樣本 (pattern) 可經由學習的過程 (learning process) 存在記憶體。
Chuan-Yu Chang Ph.D.
5
Associative Memory Networks (cont.) 依據記憶時間的長短,有兩種記憶形式:
Long-term memory Short-time memory
有兩種基本的 associative memory Autoassociative memory
A key input vector is associated to itself. The input and output space dimensions are the same.
Heteroassociative memory Key input vectors are associated with arbitrary memorized vec
tors. The output space dimension may be different from the input s
pace dimension.
Chuan-Yu Chang Ph.D.
6
General Linear Distributed Associative Memory General Linear Distributed Associative Memory 的
學習過程給網路一個 key input pattern (vector) ,然後記憶體轉換這個向量 (vector) 成一個儲存 ( 記憶 )的 pattern.
Chuan-Yu Chang Ph.D.
7
General Linear Distributed Associative Memory (cont.) Input (key input pattern)
Output (memorized pattern)
對一個 n-dimension 的神經架構,可聯想 (associate) h 個 pattern 。 h<=n 。實際上是 h < n 。
Tknkkk xxxx ,...,, 21
Tknkkk yyyy ,...,, 21
(3.1)
(3.2)
Chuan-Yu Chang Ph.D.
8
General Linear Distributed Associative Memory (cont.) Key vector xk 和記憶向量 yk間的線性對應可寫成:
其中 W(k) 為權重矩陣 (weight matrix) 從這些權重矩陣,可建構 Memory matrix M
describes the sum of the weight matrices for every input/output pair. This can be written as
the memory matrix can be thought of as representing the collective experience.
kk xkWy )(
h
k
kWM1
)(
hkkWMM kk ,,2,1for ),(1
(3.3)
(3.4)
(3.5)
00 M
Chuan-Yu Chang Ph.D.
9
Correlation Matrix Memory
Estimation of the memory matrix Outer product rule
Each outer product matrix is an estimate of the weight matrix W(k) that maps the output pattern yk onto the input pattern xk.
An associative memory designed in outer product rule is called a correlation matrix memory
h
k
Tkk xyM
1
ˆ
hkxyMM Tkkkk ,,2,1for ,ˆˆ
1
(3.6)
(3.7)
Chuan-Yu Chang Ph.D.
10
Correlation Matrix Memory (cont.)
當某 key pattern 輸入時,希望 memory matrix 能 recall 適當的 memorized pattern ,假設 estimated memory matrix 已經由 (3.6) 學習了 h 次,則輸入第 q 個 key input xq,欲求 memorized yq
(3.8) 式可改寫成 ( 將 (3.6) 代入 (3.8))
假設每個 key input vector 具有 normalized unit length.
kqTk
h
kq
Tkkq yxxxxyxMy )(ˆ
1 (3.8)
kq
h
qkk
Tkqq
Tq yxxyxxy )()(
1
(3.9)
hkxx kTk ,,2,1for 1 (3.10)
Chuan-Yu Chang Ph.D.
11
Correlation Matrix Memory (cont.)
此 normalized unit length 可藉由除以 Euclidean norm 來求得:
因此, (3.9) 可改寫成
其中
2k
kk
x
xx
(3.11)
h
qkk
qqkqTkq zyyxxyy
1
)( (3.12)
h
qkk
kqTkq yxxz
1
)( (3.13)
Noise, crosstalk
Desired response
Chuan-Yu Chang Ph.D.
12
Correlation Matrix Memory (cont.)
從 (3.13) 當網路引進 associated key pattern (y=yq, zq=0) 若 y=yq表示輸入 xq可精確的得到其對應的輸出 yq。 若 zq不為 0 ,表示有 noise 或 crosstalk 。
若 key input 之間為 orthogonal ,則 crosstalk 為 0 因此可得到 perfect memorized pattern 。
若輸入的 key input 之間存在 linearly independent ,則可在計算 memory matrix 之前先使用 Gram-Schmidt orthogonalization 來產生 orthogonal input key vectors 。
1
1
)(k
ii
iTi
kTi
kk ggg
xgxg (3.14)
Chuan-Yu Chang Ph.D.
13
Correlation Matrix Memory (cont.)
The storage capacity of the associative memory is
The storage limit depends on the rank of the memory matrix.
nM ˆ
Chuan-Yu Chang Ph.D.
14
Correlation Matrix Memory (cont.) Example 3.1
Autoassociative memoryThe memory is trained with three key vectors:
Each of these vectors has unit length.
5556.0
6667.0
4969.0
7027.0
5556.0
4444.0
5329.0
7778.0
3333.0
321 xxx (3.15)
Chuan-Yu Chang Ph.D.
15
Correlation Matrix Memory (cont.)
The angles between these vectors:
These three vectors are far from being mutually orthogonal.
o
2221
21112 9.101cos
xx
xx T
(3.16)
o
2321
31113 5.49cos
xx
xx T
(3.17)
o
2322
32123 1.76cos
xx
xx T
(3.18)
Chuan-Yu Chang Ph.D.
16
Correlation Matrix Memory (cont.)
The memory matrix of the autoassociative network
Using (3.19), the estimate of the input key patterns
0865.13945.04107.0
3945.03582.11749.0
4107.01749.05555.0ˆ
332211TTT xxxxxxM
(3.19)
0707.1
0378.1
3876.0ˆˆ
7268.0
5551.0
6326.0ˆˆ
7489.0
3249.1
1023.0ˆˆ 332211 xMxxMxxMx (3.20)
Chuan-Yu Chang Ph.D.
17
Correlation Matrix Memory (cont.)
The Euclidean distance of the response vector from each of the key vectors.
9108.0ˆ
9589.1ˆ 6319.0ˆ
21331
2122121111
xx
xxxx
(3.21)
2412.1ˆ
1898.0ˆ 6575.1ˆ
22332
2222222112
xx
xxxx
(3.22)
6442.0ˆ
6363.1ˆ 9363.0ˆ
23333
2322323113
xx
xxxx
(3.23)
有較小的 error
Chuan-Yu Chang Ph.D.
18
Correlation Matrix Memory (cont.) Another key vector with unit length
The angles between these vectors
7349.0
2370.0
6354.0
6533.0
0587.0
7548.0
1629.0
9779.0
1309.0
321 xxx (3.24)
o
2221
21112 9.92cos
xx
xx T
(3.25)
o
2321
31113 3.88cos
xx
xx T
(3.26)
o
2322
32123 8.90cos
xx
xx T
(3.27)
Chuan-Yu Chang Ph.D.
19
Correlation Matrix Memory (cont.)
The memory matrix is
The estimate of input key patterns
9934.00532.00048.0
0532.00159.10217.0
0048.00217.09906.0ˆ
332211TTT xxxxxxM (3.28)
7396.0
2661.0
6207.0ˆˆ
6558.0
1108.0
7521.0ˆˆ
1092.0
9876.0
1501.0ˆˆ 332211 xMxxMxxMx (3.29)
Chuan-Yu Chang Ph.D.
20
Correlation Matrix Memory (cont.)
The Euclidean distance of the response vector from each of the key vectors.
3758.1ˆ
4865.1ˆ 0579.0ˆ
21331
2122121111
xx
xxxx
(3.30)
4382.1ˆ
0522.0ˆ 4859.1ˆ
22332
2222222112
xx
xxxx
(3.31)
0329.0ˆ
4365.1ˆ 3734.1ˆ
23333
2322323113
xx
xxxx
(3.32)
相較於 (3.21)~(3.23) 有更低的 error 。
Chuan-Yu Chang Ph.D.
21
Correlation Matrix Memory (cont.) An error correction approach for correlation
matrix memories The drawback of associative memory is the
relatively large number of errors that can occur during recall.
The simplest correlation matrix memory has no provision for correcting errors Lack of feedback from the output to input.
Chuan-Yu Chang Ph.D.
22
Correlation Matrix Memory (cont.)
The objective of the error correction approach is to have the associative memory reconstruct the memorized pattern in an optimal sense.
The error vector is defined as
其中, yk是 desired pattern with the key input pattern xk. The discrete-time learning rule based on steepest descent
kkk xMye )(ˆ)(
ˆ)(ˆ)1(ˆˆ
MMM
(3.33)
(3.34)
Chuan-Yu Chang Ph.D.
23
Correlation Matrix Memory (cont.)
其中 energy function 定義成
Computing the gradient of (3.35)
將 (3.36) 代入 (3.34)
2
2
2
2)(ˆ
2
1
2
1)ˆ( kkk xMyeM
Tkk
TkkM
xxMxyM ˆ)ˆ(ˆ
Tkkk xxMyMM ])(ˆ[)(ˆ)1(ˆ
(3.35)
(3.36)
(3.37)
Error vector
Chuan-Yu Chang Ph.D.
24
Correlation Matrix Memory (cont.)
(3.37) 式的第二項中的中括號的內容,為 error vector( 參見 (3.33)) ,因此具有 error correction 的效果。
將 (3.37) 展開可改寫成
和 (3.7) 比較起來,多了第二項。 (3.37) 式基於 error correction 的監督式學習是對 h個 association 的每個不同 pattern 重複訓練所得到。
Tkk
Tkk xxMxyMM )(ˆ)(ˆ)1(ˆ
hkyx kk ,,2,1for ,
(3.38)
(3.39)
Chuan-Yu Chang Ph.D.
25
Correlation Matrix Memory (cont.) Selecting the learning rate parameter
Fixed learning rate Adjustable learning rate with respect to time For each association, the iterative adjustments to t
he memory matrix (3.37) continue until the error vector ek (3.33) becomes negligibly small.
The initial memory matrix M(0)=0. 和 (2.29) 比較,發現 (3.37) 也是 LMS 。
Chuan-Yu Chang Ph.D.
26
Backpropagation Learning Algorithm Backpropagation algorithm
Generalized delta rule Training MLPs with backpropagation algorithms results in a
nonlinear mapping. The MLPs can have its synaptic weights adjusted by the B
P algorithm to develop a specific nonlinear mapping. The fixed weights after the training process can provide an
association task for classification, pattern recognition, diagnosis, etc.
During the training phase of the MLP, the synaptic weights are adjusted to minimize the disparity between the actual and desired outputs of the MLP, averaged over all input patterns.
Chuan-Yu Chang Ph.D.
27
Backpropagation Learning Algorithm (cont.) Basic BP algorithm for the feedforward MLP
3 layers network -1 output layer -2 hidden layers
Chuan-Yu Chang Ph.D.
28
Backpropagation Learning Algorithm (cont.)
The standard BP algorithm for training of the MLP NN is based on the steepest descent gradient approach applied to the minimization of an energy function representing the instantaneous error.
其中 dq表示第 q 個 input 的 desired output.X(3)
out=yq為 MLP 的實際輸出。
3
1
2)3(,
)3()3( )(2
1)()(
2
1 n
hhoutqhoutq
Toutqq xdxdxdE (3.40)
Chuan-Yu Chang Ph.D.
29
Backpropagation Learning Algorithm (cont.)
Using the steepest-descent gradient approach, the learning rule for a network weight in any one of the network layers is given by
where s=1,2,3
)()()(
sji
qssji w
Ew
(3.42)
Chuan-Yu Chang Ph.D.
30
Backpropagation Learning Algorithm (cont.)
The weights in the output layer can be updated according to
Using the chain rule for the partial derivatives, (3.42) can be rewritten as
)3()3()3(
ji
qji w
Ew
(3.42)
)3(
)3(
)3()3()3(
ji
j
j
qji w
v
v
Ew
(3.43)
Chuan-Yu Chang Ph.D.
31
Backpropagation Learning Algorithm (cont.)
分別計算 (3.43) 式的各項
or
)2(,
1
)2(,
)3()3()3(
2
iout
n
hhoutjh
jiji
(3)j xxw
ww
v
(3.44)
)3()3(
1
2)3()3()3(
3
)]([2
1jjqj
n
hhqh
jj
q gfdfdE
(3.45)
)3()3()3(,)3( ˆ jjjoutqj
j
q gxdE
(3.46)
Local error, delta
Eq 以 (4.30) 式代入
Chuan-Yu Chang Ph.D.
32
Backpropagation Learning Algorithm (cont.)
Combining (3.43), (3.44), (3.46), the learning rule for the weights in the output layer of the network is
or
In the hidden layer, applying the steepest descent gradient approach
)2(,
)3()3()3(ioutjji xw (3.47)
)2(,
)3()3()3()3( )()1( ioutjjiji xkwkw (3.48)
)2(
)2(
)2()2(
)2()2()2(
ji
j
j
q
ji
qji w
E
w
Ew
(3.49)
Chuan-Yu Chang Ph.D.
33
Backpropagation Learning Algorithm (cont.)
(3.49) 式右側的二階微分項,可表示成
)1(,
1
)1(,
)2()2()2(
)2(1
iout
n
hhoutjh
jiji
j xxwww
(3.50)
2
2
,
2
1 1
2
,
3
2
,
2
3 2
2
1
j
jout
h ppouthpqh
joutj
q xxwd
xE n n
f
(3.51)
223
1
3
233
1
3
,2
3
3
jjhjh
h
jhjhh
houtqh
j
q
gn
ggn
w
wxdE
(3.52)
Chuan-Yu Chang Ph.D.
34
Backpropagation Learning Algorithm (cont.)
Combining equation (3.49), (3.50), and (3.52) yields
or
xw joutjji
)1(
,
)2()2()2( (3.53)
xww ioutjjiji
kk1
,
22221 (3.54)
Chuan-Yu Chang Ph.D.
35
Backpropagation Learning Algorithm (cont.)
Generalized weights update form
where (for the output layer)
(for the hidden layers)
xww
s
iout
s
j
ss
ji
s
jikk
1
,1
(3.55)
s
j
s
joutqh
s
jgxd ,
(3.56)
s
jh
s
hj
s
h
s
jg
nS
w
1
1
11
(3.57)
Chuan-Yu Chang Ph.D.
36
Backpropagation Learning Algorithm (cont.) Standard BP algorithm
Step 1: Initialize the network synaptic weights to small random values.
Step 2: From the set of training input/output pair, present an input pattern and calculate the network response.
Step 3: The desired network response is compared with the actual output of the network, and by using (3.56) and (3.57) all the local errors can be computed.
Step 4: The weights of the network are updated according to (3.55).
Step 5: Until the network reaches a predetermined level of accuracy in producing the adequate response for all the training patterns, continue steps 2 through 4.
Chuan-Yu Chang Ph.D.
37
Backpropagation Learning Algorithm (cont.) Some Practical Issues in Using Standard BP Initialization of synaptic weights
Initially set to small random values. 若是設太大,很可能會造成 saturation.
Heuristic algorithm (Ham, Kostanic 2001) Set the weights are uniformly distributed random number
s in the interval from -0.5/fan-in to 0.5/fan-in
Chuan-Yu Chang Ph.D.
38
Backpropagation Learning Algorithm (cont.) Nguyen and Widrow’s initialization algorithm
Define ( 適合在具有一個隱藏層的架構 ) n0: number of components in input layer n1: number of neurons in hidden layer : scaling factor
Step 1: Compute the scaling factor according to
Step 2: Initialize the weights wij of a layer as random numbers between -0.5 and 0.5
Step 3: Reinitialize the weights according to
Step 4: For the i-th neuron in the hidden layer, set the bias to be a random number between –wij and wij.
n n0
17.0
n
iij
ij
ij
w
ww
1
1
2
(3.58)
(3.59)
Chuan-Yu Chang Ph.D.
39
Backpropagation Learning Algorithm (cont.) Network configuration and ability of the network to g
eneralize The configuration of the MLP NN 由下列所決定:
Hidden layer 的數量、每個 hidden layer 的神經元數量、神經元所採用的 activation function 。
Network performance 的影響不在於 activation function 的形式,而在於 hidden layer 的數量、每個 hidden layer 的神經元數量。
hidden layer 神經元數量是以 trial and error 的方式決定。 MLP NN 一般被設計成有 2 層 hidden layer 。 一般而言,較多的 Hidden layer 神經元數量,可保證 good net
work performance ,但是一個” over-designed” 架構,會造成” over-fit” ,而喪失網路的 generalization 。
Chuan-Yu Chang Ph.D.
40
Backpropagation Learning Algorithm (cont.) Example 3.2
此網路具有一個 hidden layer , 50 個神經元。 要訓練一個非線性方程式 (a) 在 [0,4] 之間,每 0.2 取樣一點,共 21 點。 (b) 在 [0,4] 之間,每 0.01 取樣一點,共 400 點。 ( 造成 ove
rfitting)
)3sin( xey x
Chuan-Yu Chang Ph.D.
41
Backpropagation Learning Algorithm (cont.) Independent validation
使用 training data 來評估網路最後的 performance quality ,會造成 overfitting 。
可使用 independent validation 來避免此問題。 將可用的 data 分成 training set 和 testing set 。 一開始先將 data randomize , 接著,再將資料分成兩部分: training set 用來 update 網
路的權重。 Testing set 用來評估 training 的效能
Chuan-Yu Chang Ph.D.
42
Backpropagation Learning Algorithm (cont.) Speed of convergence
The convergence properties depend on the magnitude of the learning arte. 為保證網路能夠收斂,且避免訓練過程的震盪, learning
rate 必須設成相當小的值。 若是網路的訓練起始點遠離 global minimum ,會造成許
多神經元的飽和,使得梯度變化變小,甚至卡在 error surface 的局部最小值,造成收斂速度的減慢。
快速演算法可分成兩大類: Consists of various heuristic improvements to the standard
BP algorithm Involves use of standard numerical optimization techniques
Chuan-Yu Chang Ph.D.
43
Backpropagation Learning Algorithm (cont.) Backpropagation Learning Algorithm with Mo
mentum Updating To update the weights in the direction which is a li
near combination of the current gradient of the instantaneous error surface and the one obtained in the previous step of the training.
The weights are updated according to
or
111
,
kkkk wxws
ji
s
iout
s
j
ss
ji
1111
,
1
,
kkkkkk xxwws
iout
s
j
s
iout
s
j
ss
ji
s
ji
(3.61)
(3.62)
Momentum term
Chuan-Yu Chang Ph.D.
44
Backpropagation Learning Algorithm (cont.)
This type of learning improves convergence If the training patterns contain some element of
uncertainly, then updating with momentum provides a sort of low-pass filtering by preventing rapid changes in the direction of the weight updates.
Render the training relatively immune to the presence of outliers or erroneous training pairs.
Increase the rate of weight change, and the speed of convergence is increased.
Chuan-Yu Chang Ph.D.
45
Backpropagation Learning Algorithm (cont.)
前述的幾點可表示成下列的 update equation
如果網路是操作在 error surface 的平坦區域,則 gradient 將不會改變,因此 (3.63) 可改寫成
由於 forgetting factor總是小於 1 ,所以 effective learning rate 可定成
11
kk w
wE
ws
jis
ji
qss
ji
wE
wE
wE
wE
wE
w
s
ji
q
s
s
ji
qs
s
ji
qs
s
ji
qs
s
ji
qss
jik
1
1
1
2
2
1
s
s
eff
(3.63)
(3.64)
(3.65)
Chuan-Yu Chang Ph.D.
46
Backpropagation Learning Algorithm (cont.) Batch Updating
標準的 BP 假設權重的更新是對每一個 input/output training pair 。
Batch-updating 則是累積許多的 training pattern才來更新權重值。 ( 可視為將許多個別 I/O pair 的修正量平均後再修正權重 ) 。
Batch-updating 具有下列優點: Gives a much better estimate of the error surface. Provides some inherent low-pass filtering of the training
pattern. Suitable for more sophisticated optimization procedures.
Chuan-Yu Chang Ph.D.
47
Backpropagation Learning Algorithm (cont.) Search-Then-Converge Method
A kind of heuristic strategy for speeding up BP Search phase :
The network is relatively far from the global minimum. The learning rate is kept sufficiently large and relatively constant.
Converge phase The network is approaching the global minimum. The learni
ng rate is decreased at each iteration.
Chuan-Yu Chang Ph.D.
48
Backpropagation Learning Algorithm (cont.)
因為實際上無法知道網路距離 global minimum 有多遠,因此可以下列兩式來估計:
基本上, 1<c/0<100, 100<k0<500 當 k<<k0 , learning rate近似於 0 。 (search phase) 當 k>>k0 , learning rate 以 1/k 到 1/k2的比例減少 (con
verage phase)
0
01
1
kkk
20000
00
01
1
kkkkkc
kkck
(3.66)
(3.67)
Chuan-Yu Chang Ph.D.
49
Backpropagation Learning Algorithm (cont.) Batch Updating with Variable Learning Rate
A simple heuristic strategy to increase the convergence speed of BP with batch update.
To increase the magnitude of the learning rate if the learning in the previous step has decreased the total error function.
If the error function has increased, the learning rate need to be decreased.
Chuan-Yu Chang Ph.D.
50
Backpropagation Learning Algorithm (cont.)
The algorithm can be summarized as If the error function over the entire training set has decreased,
increase the learning rate by multiplying it by a number >1 (typically =1.05)
If the error function has increased more than some set percentage , decrease the learning rate by multiplying it by a number <1 (typically =0.7)
If the error function is increased less than the percentage , the learning rate remains unchanged.
Apply the variable learning rate to batch updating can significantly speed up the convergence.
The algorithm easily be trapped in a local minimum. 可設定最小 learning rate min
Chuan-Yu Chang Ph.D.
51
Backpropagation Learning Algorithm (cont.) Vector-Matrix Form of the BP Algorithm
根據圖 (3.4) 可定義能量函數為:
使用 steepest descent approach , the synaptic weight update equation can be written as
xdxdE qoutq
T
qoutqq
3
,
3
,2
1 (3.68)
k
kkw
Eww s
ij
qss
ij
s
ij
1
(3.69)
Chuan-Yu Chang Ph.D.
52
Backpropagation Learning Algorithm (cont.)
Applying the chain rule to (3.68)
(3.70) 的第二項可重寫成
(3.70) 的第一項,一般被稱為 sensitivity 項
w
EwE
s
ij
s
is
i
qs
ij
q
xxwww
s
jouth
s
hout
s
ihs
ij
s
ij
s
ins
1
,1
1
,
1
s
i
qs
i
E
(3.70)
(3.71)
(3.72)
Chuan-Yu Chang Ph.D.
53
Backpropagation Learning Algorithm (cont.)
Use (3.70) and (3.71) to rewrite (3.69) as
or in vector-matrix form
where D(s) represents the sensitivity vector of the sth layer
xww
s
jout
s
i
ss
ij
s
ijkk
1
,1
xDWW
s
out
sssskk
11
T
sq
sq
sqs
ns
EEED
,,,
21
(3.73)
(3.74)
(3.75)
Chuan-Yu Chang Ph.D.
54
Backpropagation Learning Algorithm (cont.) Calculation of the sensitivities
The sensitivity vector of the output layer
where g(x)=df(x) 為 activation function 為為為為為
33
,
33
1
23
3
33
33
'
2
1
2
1
3
iioutqiiiqi
hhqh
i
outq
T
outq
ii
q
gff
nf
xdd
d
xdxdE
(3.76)
Chuan-Yu Chang Ph.D.
55
Backpropagation Learning Algorithm (cont.)
將 (3.76) 代入 (3.75) ,則輸出層的 sensitivity vector 可表示成
使用 (3.75) 式的定義及 chain rule ,第二層的 sensitively vector 可表示成輸出層的 sensitively vector的函數
xdGD outq
333
DEE
DT
q
T
q 3
2
3
32
3
2
2
(3.77)
(3.78)
Chuan-Yu Chang Ph.D.
56
Backpropagation Learning Algorithm (cont.)
其中, (3.78) 式中的 sensitivity 可表示成 Jacobian matrix
第一層的 sensitivity vector 可表示成
T
n
nnn
n
n
T
2
2
3
32
2
3
32
1
3
3
2
2
3
22
2
3
22
1
3
2
2
2
3
12
2
3
12
1
3
1
2
3
DDT
2
1
21
(3.79)
(3. 80)
Chuan-Yu Chang Ph.D.
57
Backpropagation Learning Algorithm (cont.)
(3.79) 式 Jacobian matrix 中的元素
將 (3.81) 式代入 (3.79) 式, Jacobian matrix 可改寫成
其中
2323
2
2
3
2
2
,3
1
2
,
3
22
3
'
2
jijjij
j
j
ij
j
jout
ijh
houtih
jj
i
gff
n
www
xwxw
WGGWTT
T
3223
2
3
)2(2
)2(2
)2(1
)2( ,...,, nvgvgvgdiagvG
(3.81)
(3.82)
Chuan-Yu Chang Ph.D.
58
Backpropagation Learning Algorithm (cont.)
最後將 (3.82) 式及 (3.78) 式合併後,可得
同理,第一層的 sensitivity 可表示成
DWGD
T 3322
DWGD
T 2211
(3.83)
(3.84)
Chuan-Yu Chang Ph.D.
59
Backpropagation Learning Algorithm (cont.) Vector-matrix form of the BP algorithm
Step 1: Present an input pattern and calculate the outputs of the network and all the internal layers.
Step 2: For each of the layers, calculate the sensitivity vector according to
Step 3: Update the synaptic weights of the network according to
Continue steps 1 through 3 until the network reseachs the desired mapping accuracy.
layershidden allfor
layersoutput for )()()1()1(
)()()(
sTsss
soutq
ss
DWvGD
xdvGD
Tsout
ssss xDWkW )1()()()()( )1(
Chuan-Yu Chang Ph.D.
60
Accelerated Learning BP Algorithm Conjugate Gradient BP for the Feedforward
Multilayer Perceptron Conjugated Gradient 是一種著名的數值方法,用來解不同的最佳化問題,它可視為最陡坡降法和牛頓法的折衷。
假設在解的附近,網路中所有權重的誤差函數可被一個二次方程式精確的估計:
P
p h
TT
phph
nwQw
PwJ
s
bwyd1 1
2
2
1
2
1(3.85)
訓練樣本數
輸出層神經元個數
第 p 次訓練的第 h個神經元的 desired
Chuan-Yu Chang Ph.D.
61
Accelerated Learning BP Algorithm (cont.)
(3.85) 式中的 Q 是一個二次微分的方陣, Hessian Matrix 。
Conjugated gradient algorithm企圖尋找 error surface 的 conjugate 方向,並且在此方向上 update 權重。
由於 Hessian matrix Q 的計算相當複雜,因此大部分的 conjugated gradient algorithm嘗試尋找 conjugate 方向而不明確計算 Hessian matrix 。 對每一個神經元架構一組 normal equation ,然後再使用
conjugate gradient法疊代的去解此 normal equation 。
Chuan-Yu Chang Ph.D.
62
Accelerated Learning BP Algorithm (cont.) Normal equations for the linear combiner
MLP NN 的第 s 層的第 i 個 linear combiner 的輸出可表示成
假設對每一個 combiner 的 desired 為已知,則學習的目標在於使 cost function 的平方差最小:
xw
s
qout
Ts
i
s
i
1
,
M
q
s
qi
s
qi
s
i dJ1
2
,,2
1
(3.86)
(3.87)
訓練樣本數
Chuan-Yu Chang Ph.D.
63
Accelerated Learning BP Algorithm (cont.)
將 (3.86) 代入 (3.87) 式,可得
J 對 w作偏微分
定義
M
q
s
qout
Ts
i
s
qi
s
i xwdJ1
21
,,2
1
0
1
1
,
1
,
1
,,
M
q
s
i
Ts
qout
s
qout
s
qout
s
qis
i
s
i wxxxdwJ
xxC
Ts
qout
M
q
s
qout
s
i
1
,1
1
,
xdp s
qout
M
q
s
qi
s
i
1
,1
,
(3.88)
(3.89)
(3.90)
(3.91)
Chuan-Yu Chang Ph.D.
64
Accelerated Learning BP Algorithm (cont.)
式 (3.89) 可改寫成
Ci(s)可視為第 s 層輸入向量的 covariance matrix 的
估計, pi(s)則為第 s 層的輸入和 linear combiner 的
desired output 間的 cross-correlation vector. 網路的訓練可被視為包含他們的解的一個過程
希望 desired 和 actual output 的差為 0 。
pwC
s
i
s
i
s
i (3.92)
Chuan-Yu Chang Ph.D.
65
Backpropagation Learning Algorithm (cont.) BP with Adaptive Slopes of Activation Function
For the standard BP, the rate of the update is proportional to the derivative of the nonlinear activation function.
See Fig. 3.6, 在網路訓練的過程,神經元的輸出可能會進入激發函數的飽和值,使得激發函數的微分值相當小,造成網路的學習相當緩慢。 解決方法:
降低激發函數的斜率,以減少飽和區的範圍。 Optimal activation function slope that balances the speed o
f the network training and its mapping capability
Chuan-Yu Chang Ph.D.
66
Backpropagation Learning Algorithm (cont.)
Typical MLP NN activation function Derivative of the activation function in (a)
Chuan-Yu Chang Ph.D.
67
Backpropagation Learning Algorithm (cont.)
使用與 (3.40) 式相同的 minimize performance criterion
其中, s 表示網路的層級。 d 為 desired 輸出 , x 為實際輸出。
Sigmoid activation function
其中為輸入,為斜率。
sn
h
shoutqh
soutq
Tsoutqq xdxdxdE
1
2)(,
)()(
2
1
2
1
)exp(1
)exp(1,
f
(3.113)
(3.114)
Chuan-Yu Chang Ph.D.
68
Backpropagation Learning Algorithm (cont.)
s 層第 i 個神經元的斜率更新率為
利用 chain rule ,則 (3.115) 式的第二項可改寫成
)()()( )()1( s
i
qsi
si
Ekk
,
,1 )()(
)(,
)()(,
)(
)(
)(,
)(,
)(
)()(
f
fx
x
x
x
EE
sis
i
siout
si
siout
si
si
siout
siout
si
si
qs
i
q
(3.115)
(3.116)
為為為為為為為為為為為為為為為為為為為為為
Chuan-Yu Chang Ph.D.
69
Backpropagation Learning Algorithm (cont.)
若採用 (3.114) 式的 activation function ,則
將 (3.116~3.118) 代入 (3.115) ,則激發函數的動態斜率更新方程式為
2,12
1, ff
2,12
1, ff
)()()( )()1( si
si
si kk
(3.117)
(3.118)
(3.119)
Chuan-Yu Chang Ph.D.
70
Bcakpropagation algorithm with adaptive activation function slopes Step 1: Initialize the weights in the network according to the
standard initialization procedure. Step 2: From the set of the training input/output pairs, present
the input pattern and calculate the network response. Step 3: Compare the desired network response with the actual
output of the network, and the local errors are computed using
layershidden for
layeroutput for
)()1(
1
)1(
)()(,
2s
is
hi
n
h
sh
si
si
sioutqh
si
vgw
vgxd
Chuan-Yu Chang Ph.D.
71
Bcakpropagation algorithm with adaptive activation function slopes (Cont.) Step 4: The weights of the network are updated
according to
Step 5:The slopes of the activation functions are updated according to
Step 6: Stop if the network has converged; else go back to step 2.
)(,
)()()()( 1 sjout
si
ssij
sij xkwkw
11 )()()()( kkkk si
si
si
sij
sij
momentumterm
Chuan-Yu Chang Ph.D.
72
Levenberg-Marquardt Algorithm The Levenberg-Marquardt BP Algorithm represents
a simplified version of Newton’s method applied to the problem of training MLP NNs.
Summary of the Newton’s optimization algorithm Finding a vector w that minimizes a given energy function E
(w). According to Newton’s method, an iterative procedure will a
ccomplish this minimization task by follows steps: Step 1: Initialize the components of the vector w to some ra
ndom values. Step 2: Update the vector w according to
kkkk gHww 1)()1( (3.120)
Chuan-Yu Chang Ph.D.
73
Levenberg-Marquardt Algorithm (cont.)
Where matrix H-1k represents the inverse of the He
ssian matrix. The Hessian matrix is given by
The vector gk represents the gradient of the energy function, computed as
2
2
2
2
1
2
2
2
22
2
12
21
2
21
2
21
2
)()()(
)()()(
)()()(
NNN
N
N
w
wE
ww
wE
ww
wE
ww
wE
w
wE
ww
wEww
wE
ww
wE
w
wE
H
T
Nw
wE
w
wE
w
wEg
)()()(
21
(3.121)
(3.122)
Chuan-Yu Chang Ph.D.
74
Levenberg-Marquardt Algorithm (cont.)
然而, (3.120) 式有一個致命的問題,即需要解 Hessian matrix 的反矩陣,因此, LMBP提供一可行的進似取代。
如圖 3.4 的 MLP ,網路的訓練過程可視為在找尋一組權重值,使所有 pattern 在網路的 desired 和 actual output 間的誤差最小化的過程,如果 pattern 的數量有限,則 energy function 可寫成
其中, Q 是 training pattern 的總數, w 為所有權重的向量, dq為 desired , x(3)
out, q為第 q 個訓練 pattern 時網路的實際輸出。
Q
q
n
hqhoutqhqoutq
Q
q
T
qoutq xdxdxdwE1 1
2)3(,
)3(,
1
)3(,
3
2
1
2
1)( (3.123)
Chuan-Yu Chang Ph.D.
75
Levenberg-Marquardt Algorithm (cont.) 根據 Newton’s method ,使能量函數最小化的最佳化的
權重可由下式獲得 :
其中
令 P=n3Q ,則 (3.123) 可寫成
其中
kk gHkwkw 1)()1(
)(2 )( kwwk wEH
)()( kwwk wEg
(3.124)
(3.125)
(3.126)
p
p
p
pppoutp exdwE
1 1
22)3(, 2
1
2
1)(
(3.127)
)3(, poutpp xde (3.128)
Chuan-Yu Chang Ph.D.
76
Levenberg-Marquardt Algorithm (cont.) (3.126) 式的 gradient 可由下式獲得
其中
eJ
w
ee
w
ee
w
ee
w
e
w
e
w
e
w
wEg T
N
pP
Pp
pP
Pp
pP
Pp
N
P
Pp
P
Pp
P
Pp
1
21
11
1
2
2
1
2
1
1
2
2
1)(
N
PPP
N
N
w
e
w
e
w
e
w
e
w
e
w
ew
e
w
e
w
e
J
21
2
2
2
1
2
1
2
1
1
1
(3.129)
(3.130)
Chuan-Yu Chang Ph.D.
77
Levenberg-Marquardt Algorithm (cont.) Hessian matrix 的第 k,j 個元素可表示成
使用 (3.130) 式的 Jacobian matrix型式, Hessian matrix可表示成
其中, S 為一個二階微分矩陣
當接近能量函數的最小值時,矩陣 S 的元素會變得很小,因此, Hessian matrix 可進似成
kj
Pp
j
P
k
P
jkjk ww
ee
w
e
w
e
ww
wEwE
22
,2 )(
)(
SJJwE T )(2
p
P
pp eeS
1
2
(3.131)
(3.132)
(3.133)
JJH T (3.134)
Chuan-Yu Chang Ph.D.
78
Levenberg-Marquardt Algorithm (cont.) 將 (3.129) 和 (3.134) 代入 (3.124) 的 newton’s method ,
可得
(3.135) 式可能會在求 H=JTJ 的反矩陣時出現問題,因此對 (3.134)做小幅的修改
其中,是一微小值, I 為 identity matrix 將 (3.136) 代入 (3.135) 可得
(3.135) kTkk
Tk eJJJkwkw
1)()1(
IJJH T
kTkk
Tk eJIJJkwkw
1)()1(
(3.136)
(3.137)
Chuan-Yu Chang Ph.D.
79
Levenberg-Marquardt Algorithm (cont.) (3.137) 式其實是最陡坡降法和牛頓法之間的轉換 當 (3.137) 式中的值很小時,則 (3.137)近似於 (3.135)
式的牛頓法 ; 當 (3.137) 式中的值增大時,則 (3.137) 式中括號內的第
二項會主宰 (3.137) 式,因此 (3.137) 式可改寫成
定義 k=1/k ,並使用 (3.129) 式,則 (3.138) 可改寫成
k
Tk
kk
Tkk
Tkk
Tk
eJkweJIkw
eJIJJkwkw
1
)()(
)()1(
1
1
kk gkwkw )()1(
(3.138)
(3.139)
為最陡坡降法的形式
Chuan-Yu Chang Ph.D.
80
Levenberg-Marquardt Algorithm (cont.) 實現 LMBP演算法的最大問題在於 Jacobian matrix J(w)
的計算,矩陣中的每個元素的計算如下 :
(3.140) 式的微分式,可用下式來近似
其中 ei表示因微小的權重值變化 wj所造成的輸出誤差的改變量
j
iji w
eJ
,
j
iji w
eJ
,
(3.140)
(3.141)
Chuan-Yu Chang Ph.D.
81
Levenberg-Marquardt Algorithm (cont.) Levenberg-Marquardt BP Algorithm
Step 1: Initialize the network weights to small random values. Set the learning rate parameter.
Step 2: Present an input pattern, and calculate the output of the network.
Step 3: Use (3.141) to calculate the elements of the Jacobian matrix associated with the input/output pairs.
Step 4: When the last input/output pair is presented, use (3.137) to perform the update of the weights.
Step 5: Stop if the network has converged; else go back to Step 2.
Chuan-Yu Chang Ph.D.
82
Levenberg-Marquardt Algorithm (cont.) Additional remarks:
The weight update approach presented here is a batch version of the LMBP algorithm.
The learning rate parameter k can be modified dynamically during the learning process.
The Jacobian matrix used in the update equation need not be computed for the entire set of input/output training pairs.
Chuan-Yu Chang Ph.D.
83
Counterpropagation
Counterpropagation networks 由 Hecht-Nielsen (1987) 所提出。 它的功能如同一個自規劃 (self-programming) 的最佳查詢表 (lookup table) ,提供輸入和輸出訓練樣本間的雙向對應。
Counterpropagation networks 的收斂速度比 MLP快,但其所需的神經元數量比 MLP 多。
Counterpropagation networks 的架構分成兩種: Forward-only counterpropagation networks Full counterpropagation networks
Chuan-Yu Chang Ph.D.
84
Counterpropagation (cont.) Forward-only counterpropagation networks
由輸入層 (input layer) 、隱藏層 (hidden layer) 、及輸出層 (output layer) 所組成。
兩組權重由不同的 learning algorithm 來調整。 Kohonen’s self-organizing learning rule Grossberg’s learning rule
Chuan-Yu Chang Ph.D.
85
Counterpropagation (cont.)
在網路訓練的過程,需要提供 desired pattern 。 Kohonen 和 Grossberg 層的權重是分別訓練的。
計算輸入向量和輸入層與隱藏層間權重的距離。
因此,隱藏層的輸出可由下列條件計算出來:
( 神經元的權重與輸入樣本最接近的神經元輸出 1 ,其他的輸出0)
n
kjkkjjj wxwxwxdistz
1
2
2,
otherwise0
j allfor ,zfor which integer smallest theis i if1 jii
zz
(3.142)
(3.143)
Chuan-Yu Chang Ph.D.
86
Counterpropagation (cont.)
接著,根據 Kohonen’s self-organizing learning rule 更新權重值: (只更新 winner neuron ,其他的神經元的權重值不更新,而且學習速率是逐次降低的 ) 。
輸出層的權重,則根據 Grossberg’s learning rule 更新權重值:
(只有與隱藏層中 winner 神經元相連的權重值需要更新 )
xkkwkkw jj 11
ijjijiji zykukkuku 1
(3.144)
(3.145)
Chuan-Yu Chang Ph.D.
87
Counterpropagation (cont.)
Algorithm for training counterpropagation networks Step 1: 決定分類的數量 N( 隱藏層神經元的個數 ) ,並且以亂數的形式指定權重值。
Step 2:提供輸入樣本 x ,和對應的輸出 y 。 Step 3:根據 (3.142) 計算輸入樣本和輸入權重間的距離。 Step 4:根據 (3.143) 計算隱藏層的輸出。 Step 5:根據 (3.144) 更新 Kohonen 層的權重。 Step 6:根據 (3.145) 更新 Grossberg 層的權重。 Step 7: 如果網路收斂則停止,否則回到 Step 2繼續執行。
Chuan-Yu Chang Ph.D.
88
Counterpropagation (cont.) Example 3.3
Designing a neural network to approximate the functiony=1/(x+1) in the interval[0,4].
A forward-only counter propagation network that has 20 neurons in the hidden layer.
The learning rate is set to 0.95 and decreased according to
The network trained for 50 epochs.
10exp)( 0
kk
Chuan-Yu Chang Ph.D.
89
Counterpropagation (cont.)
Full counterpropagation Forward-only counterpropagation networks只提供單方向的訓練與對應。 ( 由輸入端輸入,可求得對應的輸出 )
Full counterpropagation 可提供雙向的訓練與對應。( 由輸入端輸入,可求得對應的輸出,反之亦然 )
Full counterpropagation networks 有四組權重值: 輸入層 x , y 與分類層間的權重以 Kohonen’s法來訓練。 分類層與兩個輸出層間的權重以 Grossberg’s法來訓練。
Chuan-Yu Chang Ph.D.
90
Counterpropagation (cont.)
Chuan-Yu Chang Ph.D.
91
Algorithm for training full counterpropagation networks Step 1: 決定分類的數量 N( 隱藏層神經元的個數 ) ,並且以亂數的形式指
定權重值。 W 和 V 的初始值是介於 x 輸入的最大與最小值之間。 U 和 T 的初始值是介於 y 輸入的最大與最小值之間。
Step 2:根據下式計算輸入樣本 x 和輸入 y 間的距離。
Step 3:根據下式計算隱藏層的輸出。
Step 4:根據下式更新 Kohonen 層的權重。
Step 5:根據下式更新 Grossberg 層的權重。
Step 6: 如果網路收斂則停止,否則回到 Step 2繼續執行。
n
kkik
n
kkikj uywxz
1
2,
1
2,
otherwise0
j allfor ,zfor which integer smallest theis i if1 jii
zz
ykkukku
xkkwkkw
yiyi
xixi
11
11
ijjiyjiji
ijjixjiji
zyktkktkt
zxkvkkvkv
1
1
Chuan-Yu Chang Ph.D.
92
Radial Basis Function Neural Networks Radial Basis Function Neural Networks
模擬大腦皮質層軸突的局部調整功能,具備良好的映射能力。 被訓練來執行輸入和輸出向量空間的非線性對應。 RBFNN 是以函數逼近的方式來建構網路 RBF NN 由三層神經元所組成: input layer 、 hidden layer
(又稱為 nonlinear processing layer) 、及輸出層 (output layer)
Chuan-Yu Chang Ph.D.
93
Radial Basis Function Neural Networks (cont.)
RBF NN 的輸出定義如下:
其中, N 為隱藏層神經元的個數, x 為輸入向量, c 為輸入向量空間的 RBF 中心, w 為輸出層的權重值。
K(.) 有許多種形式如下所示:
N
k
N
kkkikkkikii micxwcxwxfy
1 12
,...,2,1 ,,
function aticmultiquadr inverse 1
function aticmultiquadr
functiongaussian /exp
function spline-plate- thin ln
ionapproximat cubic
functionlinear
22
22
22
2
3
xx
xx
xx
xxx
xx
xx
(3.146)
最常使用
Chuan-Yu Chang Ph.D.
94
Radial Basis Function Neural Networks (cont.)
上述的 activation function 中,用來控制 RBF 的寬度,稱為 spread parameter ,中心點 ck是被定義的點,執行輸入向量空間的適當取樣。
Chuan-Yu Chang Ph.D.
95
Radial Basis Function Neural Networks (cont.)
在隱藏層中的每個神經元,計算輸入 xi和其對應的中心 ci之間的 Euclidean距離,並將此距離經 fk後,當成隱藏層的輸出。
最後將此輸出乘上對應的權重加總,即得輸出 yi。 因為輸出層的神經元之間並無相互關係。因此 y1~y
m可視為由多組單一輸出,共用隱藏層的多輸出網路。
Chuan-Yu Chang Ph.D.
96
Radial Basis Function Neural Networks (cont.)
Training the RBF NN with Fixed Centers 支配 RBF NN 對應性質的兩個參數
隱藏層和輸出層間的 weight , wik
Radial basis function 的中心點, ck
Broomhead 建議從輸入樣本中隨機選擇一組輸入樣本當為中心點。 根據訓練樣本的 PDF選擇足夠多的 centers ,但很難去量化說多少
的中心點才是適當的。因此,一開始可選用相當大數量的中心點,再使用系統化的方法移除一些不會降低網路對應能力的中心點。
一旦中心點選定之後,網路的輸出可表示成
其中 Q 為輸入樣本的總數
Qqcqxwqy k
N
kik 1,2,..., ,),()(ˆ
1
(3.147)
選擇了 N 個中心點
Chuan-Yu Chang Ph.D.
97
Radial Basis Function Neural Networks (cont.)
將 (3.147) 重新以向量的型式表示成
或
NN
N
N
w
w
w
cQxcQxcQx
cxcxcx
cxcxcx
Qy
y
y
...
)),(()),(()),((
)),2(()),2(()),2((
)),1(()),1(()),1((
)(ˆ
)2(ˆ
)1(ˆ
2
1
21
21
21
wy ˆ
(3.148)
(3.149)
網路的實際輸出 輸出層的權重
網路實際輸出
第 Q 個輸入和第 n 個中心的 Euclidean distance
Chuan-Yu Chang Ph.D.
98
Radial Basis Function Neural Networks (cont.)
因為中心點 ck是固定的,因此網路在隱藏層間的對應也是固定,所以網路的訓練只在於調整輸出權重,因此,實際和 desired 輸出的 MSE 可表示成:
將 (3.149) 式代入 (3.150)
)ˆ()ˆ(2
1)](ˆ)([
2
1)( 2
1
yyyyqyqywJ dT
d
Q
qd
)2(2
1)()(
2
1)( wwwyyywywywJ TTT
ddTdd
Td
(3.150)
(3.151)
Chuan-Yu Chang Ph.D.
99
Radial Basis Function Neural Networks (cont.)
將 (3.151) 式最小化,以求得 w
0)(
w
wJ
0 wy Td
T
ddTT yyw 1)(
(3.152)
(3.153)
(3.154)
pseudoinverse
如果中心點的數目大於等於 training pattern 的數量,則網路的實際輸出和 desired 輸出的差會很小。
Chuan-Yu Chang Ph.D.
100
Radial Basis Function Neural Networks (cont.)
從 (3.154) 可看出在固定網路中心的情況下,網路的訓練會得到一個固定解,意味著 RBF NN 比 BP和 counterpropagation 的訓練更快速。
如果中心的數目大於等於 training pattern 的數目,則網路的實際輸出和 desired 輸出之間的差異會很小,事實上如果使用 (3.154) ,則 (3.151) 式的 error 會等於 0 。
Chuan-Yu Chang Ph.D.
101
Radial Basis Function Neural Networks (cont.) Example 3.4
要訓練一個 RBFNN 來近似非線性方程式 Interval [0,4] , hidden layer: 21 neurons 。 Fig. (a) :training :取樣間距 0.2 ,所以隱藏層有 21 個神經元。
( 中心點即為此 21 個取樣值, Gaussian RBF, =0.2) Fig. (b) :testing :取樣間距 0.01 ,有 401 個取樣點。
)3sin( xey x
些許的 verfit ,相對於 BP 已經好很多。
Chuan-Yu Chang Ph.D.
102
Radial Basis Function Neural Networks (cont.)
Setting the spread parameter spread parameter 通常以下列的 heuristic 求得
使用 (3.155) 式,隱藏層中神經元的 radial basis function 可表示成
K
dmax
2
2max
exp),( kk cxd
Kcx
(3.155)
(3.156)
選擇的中心點之間最大的 Euclidean 距離
中心點的數量
Chuan-Yu Chang Ph.D.
103
Radial Basis Function Neural Networks (cont.)
RBF NN Training algorithm with fixed centers Step 1:
Choose the centers for the RBF functions Step 2:
Calculate the spread parameter for the RBF functions according to (3.155)
Step 3: Initialized the weights in the output layer of the network
Step 4: Calculate the output of the neural network according to
(3.149) Step 5:
Solve for the network weights, using (3.154)
Chuan-Yu Chang Ph.D.
104
Radial Basis Function Neural Networks (cont.)
Training the RBF NN using the Stochastic Gradient Approach 固定中心點的 RBF NN只有輸出層的權重可以調整,訓練上很簡單。但要從輸入向量中找出適當的中心點數量卻不容易。因此必須從輸入的資料中選擇大量的中心點,造成即使是相當簡單的問題,網路的架構仍然相當複雜。
Stochastic gradient approach 的 RBF NN允許調整網路的所有的三種網路參數 ( 權重、中心點位置及 RBF 的寬度 )
Chuan-Yu Chang Ph.D.
105
定義 instantaneous error cost function
若 RBF選擇 Gaussian ,則 (3.157) 變成
網路參數的更新方程式如下:
2
1
2)(),()()(
2
1)(
2
1)(
N
kkkd ncnxnwnynenJ
2
12
2
2
)(
)()(exp)()(
2
1)(
N
k k
kkd n
ncnxnwnynJ
)()()()1( nwww nJw
nwnw
(3.157)
(3.158)
)(
)()()1(nCCk
ckk
KK
nJc
ncnc
)(
)()()1(nk
kk
kk
nJnn
(3.159)
(3.160)
(3.161)
利用最陡坡降法, J(n)分別對 w, c, 進行偏微分。
Chuan-Yu Chang Ph.D.
106
Radial Basis Function Neural Networks (cont.) RBF NN Training algorithm with stochastic gradient-based
method Step 1:
Choose the centers for the RBF functions from input vectors randomly.
Step 2: Calculate the initial value of the spread parameter for the RBF
function according to (3.155). Step 3:
Initialize the weights in the output layer of the network to some small random values.
Step 4: Present an input vector, and compute the network output according
to
N
kkkk cnxwny
1
,),()(
Chuan-Yu Chang Ph.D.
107
Radial Basis Function Neural Networks (cont.)
Step 5: Update the network parameters .
其中
Step 6: Stop if the network has converged; else, go back to step 4.
)()()()1( nnenwnw w
)()(),(),()(
)()()()1(
2ncnxncnx
n
nwnencnc kkk
k
kckk
2
3)()(),(),(
)(
)()()()1( ncnxncnx
n
nwnenn kkk
k
kkk
TNNcnxcnxcnxn ,),(,...,,),(,,),()( 2211
)()()( nynyne d
與課本不同
Chuan-Yu Chang Ph.D.
108
Radial Basis Function Neural Networks (cont.)
stochastic gradient-based method 的特性 相對於 output layer 的權重, instantaneous error cost function 是
一個曲面 (convex) ,但實際上, RBF NN 的中心點和 spread parameter 並不一定是 convex ,因此會造成 training algorithm 會有卡在 local minimum 的傾向。
一般而言, Learning rate parameters w, c, 會設成不同值 The learning rules are still less complex than BP. 因為 RBF NN只需調整 output layer 的權重。
Chuan-Yu Chang Ph.D.
109
Radial Basis Function Neural Networks (cont.) Orthogonal Least Squares
設計 RBF NN 的主要挑戰在於中心點的選擇,通常是以隨機的方式或是 stochastic gradient 法,但總是造成相當大的網路。
OLS (Orthogonal Least Squares)提供一系統化的中心點選擇方式,可大量的降低 RBF NN 的大小。 將所有訓練範例皆視為潛在的中心點 逐次從潛在的中心點選取新的中心點 (選取使網路輸出誤差最少者為新中心點 )
重複步驟二,逐次由輸入範例中挑選一個輸出誤差最小者做為新加入的中心點,直到符合 OLS選取中心點的停止標準時,停止
Gram-Schmidt orthogonalization 將一個矩陣 M 分解兩個矩陣乘積的過程。 ( 將一組M 向量轉換
成一組垂直基底向量 W ,使垂直基底W涵蓋的空間和 M 一樣 )
M=WA (3.162)
Chuan-Yu Chang Ph.D.
110
Radial Basis Function Neural Networks (cont.)
其中 A 為上三角矩陣
W 由 m 個相互正交的向量所組成
Orthogonal decomposition theorem Any vector can be decomposed uniquely with respect to the s
ubspace into two mutually orthogonal parts. One part is parallel to the subspace Y , and the other is perpendicular to it
1...00
............
...10
...1
2
112
m
m
A
),...,,(diag 21 mT hhhWW
(3.163)
(3.164)
emm
(3.165)e┴Y將 M投影至空間 Y
Chuan-Yu Chang Ph.D.
111
Radial Basis Function Neural Networks (cont.) Gram-Schmidt orthogonalization algorithm
Step 1: Set the first basis vector equal to one of the columns of
matrix M.
Step k: Extract the k-th basis vector so that it is orthogonal to the
previous k-1 vectors
Repeat step k until k=m
11 mw
iTkik wm
11 ki
1
1
k
iiikkk wmw
Chuan-Yu Chang Ph.D.
112
Orthogonal decomposition theorem
m1=w1
m2
m3
w2
w3
222
321
11
3133
111
2122
11
www
mww
ww
mwmw
www
mwmw
mw
Chuan-Yu Chang Ph.D.
113
Radial Basis Function Neural Networks (cont.) Orthogonal Squares Regression
RBF NN 可視為一種迴歸模型 (regression model)當選取適當的中心點,並建構 RBFNN 後,將所有training pattern 代入 (3.166) 可得
或
QNNNQQQ
NN
NN
d
d
d
e
e
e
w
w
w
cxcxcx
cxcxcx
cxcxcx
Qy
y
y
2
1
2
1
2211
2222112
1221111
),,(),,(),,(
),,(),,(),,(
),,(),,(),,(
)(
)2(
)1(
ewyd
(3.166)
(3.167)
Errors between desired and actual network output
共有 Q 個中心點,若選用全部 Q 個中心點,可得到 error-free mapping ,但會造成網路 size 太大。因此可利用 OLS 來選擇 N 個中心點,N<Q
Chuan-Yu Chang Ph.D.
114
Radial Basis Function Neural Networks (cont.)
在 OLS迴歸的每個階段,可選擇使 desired output 的變異量增加最大的輸入當成新的中心點。
假設選擇 N<Q 的中心點 The least-squares solution yielding the weights
The actual output of the network
By using the Gram-Schmidt orthogonalization, the regression matrix can be decomposed as
dyw
wwy N ,, 21
1000
10
1
,,, 223
11211
21
N
N
N
aa
aaa
bbbBA
(3.168)
(3.169)
(3.170)
Chuan-Yu Chang Ph.D.
115
Radial Basis Function Neural Networks (cont.)
(3.170) 式中, A 為上三角矩陣,且對角線為 1 。B 則是由彼此相互正交的向量所組成
矩陣 H 為一對角矩陣,其陣列元素 hi的值為
將 (3.170) 代入 (3.167) 式,可得
),,,(diag 21 hhhHBBT
N
kiki
Tii bbbh
1
2
eBgeBAwyd
(3.171)
(3.172)
(3.173)
Chuan-Yu Chang Ph.D.
116
Radial Basis Function Neural Networks (cont.)
A least-squares solution fro the coordinate vector g is given as
The i-th coordinate of the vector is given by
目標輸出值 yd的平方和
dT
ddTT yBHyByBBBg 11)(
iTi
dTi
i bb
ybg
N
i
Tii
TTTTTd
Td eegheeHggeeBgBgyy
1
2
(3.174)
(3.175)
(3.176)
g
Chuan-Yu Chang Ph.D.
117
Radial Basis Function Neural Networks (cont.)
Defined the error reduction ratio due to the inclusion of the p-th RBF center as
中心點的選取流程是從 training pattern 資料輸入點中,逐漸選出誤差下降率最大值 [err]p,max的點,當作新加入的中心點,並將 [err]p,max累加,直到滿足所設定的正確率為止。
d
Td
ppp
yy
gh 2
err (3.177)
Chuan-Yu Chang Ph.D.
118
Radial Basis Function Neural Networks (cont.) Orthogonal least-squares algorithm for training an R
BF network Step 1:
K=1. For 1 <=i<=Q, setCompute the error reduction ratio of the ith center as
Find
and set
and the center c1=ci1
iib )(
1
d
Td
iTid
Tii
yybb
yb
)(1
)(1
2)(1
1err
Qierr ii 1,maxerr 111
11 ib
Chuan-Yu Chang Ph.D.
119
Step k: Compute
Set
Compute
Find
and select
and the center ck=cik
121 ,...,, ,1For .2 kiiiiiiQik
jTj
iTji
jk bb
ba
)( 11 kj
1
1
)()(k
jj
ijki
ik bab
d
Td
ik
Tik
dTi
kik
yybb
yb
)()(
2)()(err
121 ,,,,1,errmaxerr kik
ik iiiiiiQik
)( kikk bb
Chuan-Yu Chang Ph.D.
120
Radial Basis Function Neural Networks (cont.)
Step k+1 Repeat step k. The regression is stopped at step N1
when
where 0<<1 is a selected tolerance value
1
1
err1N
jj
Chuan-Yu Chang Ph.D.
121
Radial Basis Function Neural Networks (cont.)
新增的向量使網路的 desired 輸出最大化。包含在迴歸矩陣中的每個向量對應到輸出資料的一個中心點。
當迴歸有足夠的 desired network output 的能量後,新增向量的動作將會停止。最後 OLS 產生一相對小的網路。
參數對於網路的精確度和複雜度的平衡有重大的影響,如果太大,則網路的精確度高,但中心點太多,造成 overfit-ing 現象。若太小,網路的精確度降低,但網路較精簡。
一般而言,設定在
2
2
1d
n
Chuan-Yu Chang Ph.D.
122
Radial Basis Function Neural Networks (cont.)
Example 3.5 Design an RBF NN that approximates the mappin
g z=f(x,y)=cos(3x)sin(2y), -1<=x<=1, -1<=y<=1
121 個中心點 Spread parameter=0.3 目的在於訓練網路能對應此 121 個中心點。 經過 OLS 後,剩下 28 個中心點,減少了 77% 。
Chuan-Yu Chang Ph.D.
123
Radial Basis Function Neural Networks (cont.)