jacek mazurkiewicz, phd softcomputing part 13: …...•radial basis function (rbf) networks are...
TRANSCRIPT
Internet EngineeringJacek Mazurkiewicz, PhD
Softcomputing
Part 13: Radial Basis Function Networks
Contents• Overview• The Models of Function Approximator• The Radial Basis Function Networks• RBFN’s for Function Approximation• The Projection Matrix• Learning the Kernels• Bias-Variance Dilemma• The Effective Number of Parameters• Model Selection
RBF• Linear models have been studied in
statistics for about 200 years and the theory is applicable to RBF networks which are just one particular type of linear model.
• However, the fashion for neural networks which started in the mid-80 has given rise to new names for concepts already familiar to statisticians
Typical Applications of NN
• Pattern Classification
• Function Approximation
• Time-Series Forecasting
( )l f= xl C N
( )f=y x mY R y
mX R x
nX R x
1 2 3( ) ( , , , )t t tft − − −=x x x x
Function Approximation
mX R fnY R
:f X Y→Unknown
mX RnY R
ˆ :f X Y→Approximator
f
Linear Models
1
(( ) )i
i
i
m
f w=
=x xFixed BasisFunctions
Weights
Linear Models
Feature VectorsInputs
HiddenUnits
OutputUnits
• Decomposition• Feature Extraction• Transformation
Linearly weighted output
1
(( ) )i
i
i
m
f w=
=x xy
1 2 m
x1 x2 xn
w1 w2 wm
x =
Example Linear Models
• Polynomial
• Fourier Series
( ) i
i
iwf xx =
( )0exp 2( )k
k jf w k xx =
( ) , 0,1,2,i
i ix x = =
( )0( ) exp 2 0,1,, 2, k x j k x k = =
1
(( ) )i
i
i
m
f w=
=x x
y
1 2 m
x1 x2 xn
w1 w2 wm
x =
Single-Layer Perceptrons as Universal Aproximators
HiddenUnits
With sufficient number of sigmoidal units, it can be a universal approximator.
1
(( ) )i
i
i
m
f w=
=x x
y
1 2 m
x1 x2 xn
w1 w2 wm
x =
Radial Basis Function Networks as Universal Aproximators
HiddenUnits
With sufficient number of radial-basis-function units, it can also be a universal approximator.
1
(( ) )i
i
i
m
f w=
=x x
Non-Linear Models
1
(( ) )i
i
i
m
f w=
=x xAdjusted by theLearning process
Weights
Radial Basis Functions
• Center
• Distance Measure
• Shape
Three parameters for a radial function:
xi
r = ||x − xi||
i(x)= (||x − xi||)
Typical Radial Functions
• Gaussian
• Hardy-Multiquadratic (1971)
• Inverse Multiquadratic
( )2
22 0 and r
r e r −
=
( ) 2 2 0 and r r c c c r = +
( ) 2 2 0 and r c r c c r = +
Gaussian Basis Function (=0.5,1.0,1.5)
( )2
22 0 and r
r e r −
=
0.5 =1.0 =
1.5 =
Inverse Multiquadratic
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
-10 -5 0 5 10
( ) 2 2 0 and r c r c c r = +
c=1
c=2c=3
c=4c=5
Most General RBF
( ) ( )1( ) ( )i i
T
i −= − −μ x μx x
1μ2μ 3μ
Basis {i: i =1,2,…} is `near’ orthogonal.
Properties of RBF’s
• On-Center, Off Surround
• Analogies with localized receptive fieldsfound in several biological structures, e.g.,– visual cortex;
– ganglion cells
The Topology of RBF
Feature Vectorsx1 x2 xn
y1 ym
Inputs
HiddenUnits
OutputUnits
Projection
Interpolation
As a function approximator
The Topology of RBF
Feature Vectorsx1 x2 xn
y1 ym
Inputs
HiddenUnits
OutputUnits
Subclasses
Classes
As a pattern classifier.
• Radial basis function (RBF) networks are feed-forward networks trained using a supervised training algorithm.
• The activation function is selected from a class of functions called basis functions.
• They usually train much faster than BP.
• They are less susceptible to problems with non-stationary inputs
• Popularized by Broomhead and Lowe (1988), and Moody and Darken (1989), RBF networks have proven to be a useful neural network architecture.
• The major difference between RBF and BP is the behavior of the single hidden layer.
• Rather than using the sigmoidal or S-shaped activation function as in BP, the hidden units in RBF networks use a Gaussian or some other basis kernel function.
The idea
x
y Unknown Function to Approximate
TrainingData
The idea
x
y
Basis Functions (Kernels)
Function Learned
1
)) ((m
i
iiwy f =
= = xx
The idea
x
y
Basis Functions (Kernels)
Function Learned
NontrainingSample
1
)) ((m
i
iiwy f =
= = xx
The idea
x
yFunction Learned
NontrainingSample
1
)) ((m
i
iiwy f =
= = xx
Radial Basis Function Networks as Universal Aproximators
1
( ) ( )m
i
i
iy wf =
= =x x
x1 x2 xn
w1 w2 wm
x =
Training set ( ) ( )
1
( ),p
k
k
ky=
= xT
Goal ( )(( ))k kfy x for all k
( )2
( )
1
( )min kp
k
k
SSE fy=
= − x
( )2
)(
1
) (
1
p mk
i
k
k
i
i
y w= =
= −
x
Learn the Optimal Weight Vector
w1 w2 wm
x1 x2 xnx =
Training set ( ) ( )
1
( ),p
k
k
ky=
= xT
Goal ( )(( ))k kfy x for all k
1
( ) ( )m
i
i
iy wf =
= =x x
( )2
( )
1
( )min kp
k
k
SSE fy=
= − x
( )2
)(
1
) (
1
p mk
i
k
k
i
i
y w= =
= −
x
( )2
( )
1
( )min kp
k
k
SSE fy=
= − x
( )2
)(
1
) (
1
p mk
i
k
k
i
i
y w= =
= −
x
Regularization
Training set ( ) ( )
1
( ),p
k
k
ky=
= xT
Goal ( )(( ))k kfy x for all k
2
1
m
i
iiw=
+
2
1
m
i
iiw=
+
0i
If regularization is unneeded, set 0i =
C
Learn the Optimal Weight Vector
( )2
2( )
1
( )
1
p mk
k
i
i
i
k wfyC = =
= − + xMinimize
( )( )
(
( )
) ( )
1
2 2kp
j
k
j
k
k
j j
wfC
fw w
y =
= − − +
x
x
( ) ( )( ) ( ) ( )
1
2 2p
k k
j
k
jj
k f wy =
= − − + x x
0 =
( ) ( ) ( )( ) * ( ) ( )
1 1
)* (p p
k k k
j j
k k
jj
kwf y = =
+ = x x x
Learn the Optimal Weight Vector
( ) ( )( )(1) ( ), ,T
p
j j j =φ x x
( ) ( )( )* * (1) * ( ), ,T
pf f=f x xDefine
( )(1) ( ), ,T
py y=y
* *
j
T T
j jjw+ =φ f φ y 1, ,j m=
( ) ( ) ( )( ) * ( ) ( )
1 1
)* (p p
k k k
j j
k k
jj
kwf y = =
+ = x x x
Learn the Optimal Weight Vector*
1 1
*
2 2
*
1
*
*
1
2 2
*
T T
T T
T T
m mmm
w
w
w
+ =
+ =
+ =
φ f φ y
φ f φ y
φ f φ y
( )1 2, , , m=Φ φ φ φ
1
2
m
=
Λ
( )* * *
1 2* , , ,T
mw w w=w
Define
**T T+ =wΦ Λf Φ y
Learn the Optimal Weight Vector
**T T+ =wΦ Λf Φ y
( )
( )
( )
* (1)
* (2)
*
* ( )p
f
f
f
=
x
xf
x
( )
( )
( )
(1)
1
(2)
1
(
*
*
1
*
)
k
k
m
k
k
m
k
k
mP
k
k
k
w
w
w
=
=
=
=
x
x
x
( ) ( ) ( )
( ) ( ) ( )
( ) ( ) ( )
(1) (1) (1)*1 21
(2) (2) (2) *1 2 2
*( ) ( ) ( )
1 2
m
m
p p pm
m
w
w
w
=
x x x
x x x
x x x
*=Φw
* *T T+ =ΛwΦ wΦ Φ y
Learn the Optimal Weight Vector
( ) *T T+ =wΦ ΛΦ Φ y
( )1
* T T−
= +Λw Φ Φ Φ y
Variance Matrix
1 T−= A Φ y
1 :−A
Design Matrix:Φ
The Empirical-Error Vector1* T−=w A Φ y
*
1w *
2w*
mw
( )kx
( )
1
kx ( )
2
kx( )k
nx( )kx ( )
1
kx ( )
2
kx( )k
nx
UnknownFunction
y ** =f Φw
1φ 2φ mφ
The Empirical-Error Vector1* T−=w A Φ y
*
1w *
2w*
mw
( )kx
( )
1
kx ( )
2
kx( )k
nx( )kx ( )
1
kx ( )
2
kx( )k
nx
UnknownFunction
y ** =f Φw
1φ 2φ mφ
Error Vector
*= −e y f*= −y Φw 1 T−= −ΦAy Φ y
( )1 T
p
−= − AI Φ Φ y = Py
Sum-Squared-Error
=e Py1 T
p
−= −P I Φ ΦA
T= +A Φ Φ Λ
( )1 2, , , m=Φ φ φ φError Vector
( )T
= Py Py
2T= y P y
( )2
( ) * ( )
1
pk k
k
SSE y f=
= − x
If =0, the RBFN’s learning algorithm
is to minimize SSE (MSE).
The Projection Matrix
=e Py1 T
p
−= −P I Φ ΦA
T= +A Φ Φ Λ
( )1 2, , , m=Φ φ φ φError Vector
2TSSE = y P y
1 2( , , , )mspan⊥e φ φ φ=Λ 0
( ) =P Py Pe = e = Py
T= y Py
RBFN’s as Universal Approximators
x1 x2 xn
y1 yl
12 m
w11 w12w1m
wl1 wl2wlml
Training set
( ) ) ( )(
1,
pk
k
k
== yxT
2
2( ) exp
2
j
j
j
− = −
x μx
Kernels
What to Learn?
• Weights wij’s• Centers j’s of j’s• Widths j’s of j’s
• Number of j’s −Model Selection
x1 x2 xn
y1 yl
12 m
w11 w12w1m
wl1 wl2wlml
One-Stage Learning
( )( ) ( )( ) ( ) ( )
1
k k k
ij i i jw y f = − x x
( ) ( )( )( ) ( ) ( )
2 21
ljk k k
j j ij i i
ij
w y f =
− = −
x μμ x x
( ) ( )( )2
( ) ( ) ( )
3 31
ljk k k
j j ij i i
ij
w y f =
− = −
x μx x
One-Stage Learning
The simultaneous updates of all three sets of parameters may be suitable for non-stationary environments or on-line setting.
( )( ) ( )( ) ( ) ( )
1
k k k
ij i i jw y f = − x x
( ) ( )( )( ) ( ) ( )
2 21
ljk k k
j j ij i i
ij
w y f =
− = −
x μμ x x
( ) ( )( )2
( ) ( ) ( )
3 31
ljk k k
j j ij i i
ij
w y f =
− = −
x μx x
x1 x2 xn
y1 yl
12 m
w11 w12w1m
wl1 wl2wlml
Two-Stage Training
Determines• Centers j’s of j’s.• Widths j’s of j’s.• Number of j’s.
Step 1
Step 2 Determines wij’s.
E.g., using batch-learning.
Train the Kernels
Unsupervised Training
Methods• Subset Selection
– Random Subset Selection– Forward Selection– Backward Elimination
• Clustering Algorithms– KMEANS– LVQ
• Mixture Models– GMM
Subset Selection
Random Subset Selection
• Randomly choosing a subset of points from training set
• Sensitive to the initially chosen points.
• Using some adaptive techniques to tune– Centers– Widths– #points
Clustering Algorithms
Clustering Algorithms
Clustering Algorithms
Clustering Algorithms
1
2
3
4
Questions -How should the user choose the kernel?
• Problem similar to that of selecting features for other learning algorithms.➢Poor choice ---learning made very difficult.
➢Good choice ---even poor learners could succeed.
• The requirement from the user is thus critical.– can this requirement be lessened?
– is a more automatic selection of features possible?
Goal Revisit
Minimize Prediction Error
• Ultimate Goal − Generalization
• Goal of Our Learning Procedure
Minimize Empirical Error
Badness of Fit• Underfitting
– A model (e.g., network) that is not sufficiently complex can fail to detect fully the signal in a complicated data set, leading to underfitting.
– Produces excessive bias in the outputs.
• Overfitting– A model (e.g., network) that is too complex may
fit the noise, not just the signal, leading to overfitting.
– Produces excessive variance in the outputs.
Underfitting/Overfitting Avoidance
• Model selection • Jittering • Early stopping • Weight decay
– Regularization– Ridge Regression
• Bayesian learning • Combining networks
Best Way to Avoid Overfitting
• Use lots of training data, e.g.,– 30 times as many training cases as there
are weights in the network.
– for noise-free data, 5 times as many training cases as weights may be sufficient.
• Don’t arbitrarily reduce the number of weights for fear of underfitting.
Badness of FitUnderfit Overfit
Badness of FitUnderfit Overfit
Bias-Variance Dilemma
Underfit Overfit
Large bias Small bias
Small variance Large variance
Underfit Overfit
Large bias Small bias
Small variance Large variance
Bias-Variance Dilemma
More on overfitting
• Easily lead to predictionsthat are far beyond the range of the training data.
• Produce wild predictionsin multilayer perceptrons even with noise-free data.
Bias-Variance Dilemma
fitmore
overfitmore
underfit
Bias
Variance
Bias-Variance Dilemma
Sets of functions
E.g., depend on # hidden nodes used.
The true modelnoise
bias
Solution obtained with training set 2.
Solution obtained with training set 1.
Solution obtained with training set 3.
biasbias
Bias-Variance Dilemma
Sets of functions
E.g., depend on # hidden nodes used.
The true modelnoise
bias
Variance
Model Selection
Sets of functions
E.g., depend on # hidden nodes used.
The true modelnoise
bias
Variance
Bias-Variance Dilemma
Sets of functions
E.g., depend on # hidden nodes used.
The true modelnoise
bias
( )g x( ) ( )y g = +x x
( )f x
Bias-Variance Dilemma
( )2
( ) ( )E y f −
x x ( )2
( ) ( )E g f = + −
x x
( ) ( )2 2( ) ( ) 2 ( ) ( )E g f g f = − + − +
x x x x
( ) 2 2( ) ( ) 2 ( ) ( )E g f E g f E = − + − +
x x x x
Goal: ( )2
min ( ) ( )E g f −
x x( )2
min ( ) ( )E y f −
x x
0 constant
Bias-Variance Dilemma
( )2
( ) ( )E g f −
x x ( )2
[ ( )] [( ) () )( ]E f E fE g f − −
+ =
x xx x
( )2
[ ]( )) (E fE g = −
x x ( )2
[ ]( )) (E fE f + −
x x
( )( )[ ( )] [ ( )) ( ]2 ( )E g fE f E f− − − x xx x
( ) ( ) ( )( )2 2
[ ( )] [ ( )]( ) ( ) [ ( )] [2 ( ) ( ) ( )]E f E f E f EE g g f ff = + − − −
− −
x x x xx x x x
( )( )( ) ( )[ ( )] [ ( )]E E f Eg f f− −x xxx
( ) ( )E g f= x x [ )]( ()EE fg− x x [ ( ()] )E E f f− x x [ ( )] [ ( )]E f E fE+ x x
( ) ( )E g E f= x x [ )]( () EE fg− x x [ ( )] ( )EE f f− x x [ ( )] [ ( )]E f E f+ x x 0=
0
Bias-Variance Dilemma
( ) ( )2 22( ) ( ) ( ) ( )E y f E E g f − = + −
x x x x
( ) ( )2 22 [ ( )] [ ( )( ( ]) )E fE E fE g E f = + + −
−
x x xx
noise bias2 variance
Cannot be minimized
Minimize both bias2 and variance
Model Complexity vs. Bias-Variance
( ) ( )2 22( ) ( ) ( ) ( )E y f E E g f − = + −
x x x x
( ) ( )2 22 [ ( )] [ ( )( ( ]) )E fE E fE g E f = + + −
−
x x xx
noise bias2 variance
ModelComplexity(Capacity)
Bias-Variance Dilemma
( ) ( )2 22( ) ( ) ( ) ( )E y f E E g f − = + −
x x x x
( ) ( )2 22 [ ( )] [ ( )( ( ]) )E fE E fE g E f = + + −
−
x x xx
noise bias2 variance
ModelComplexity(Capacity)
Example (Polynomial Fits)22exp( 16 )y x x= + −
Example (Polynomial Fits)
Example (Polynomial Fits)
Degree 1 Degree 5
Degree 10 Degree 15
Variance Estimation
Mean
Variance ( )22
1
1ˆ
p
i
i
xp
=
= −
In general, not available.
Variance Estimation
Mean1
1ˆ
p
i
i
x xp
=
= =
Variance ( )22 2
1
ˆ1
1ˆ
p
i
i
s xp
=
= = −−
Loss 1 degree of freedom
Simple Linear Regression
0 1y x = + +
2~ (0, )N 0
1
y
x
=
+
y
x
Simple Linear Regression
y
x
0
1ˆˆ
y
x
=
+0 1y x = + +
2~ (0, )N
Mean Squared Error (MSE)
y
x
0
1ˆˆ
y
x
=
+0 1y x = + +
2~ (0, )N
2
2ˆ
SSEMSE
p = =
−Loss 2 degrees
of freedom
Variance Estimation
2ˆSSE
MSEmp
= =−
Loss m degrees of freedom
m: #parameters of the model
The Number of Parameters
w1 w2 wm
x1 x2 xnx =
1
( ) ( )m
i
i
iy wf =
= =x x#degrees of freedom:
The Effective Number of Parameters ()
w1 w2 wm
x1 x2 xnx =
1
( ) ( )m
i
i
iy wf =
= =x x The projection Matrix
1 T
p
−= −P I Φ ΦA
T= +A Φ Φ Λ
=Λ 0
( )p trace = − P m=
The Effective Number of Parameters ()
The projection Matrix
1 T
p
−= −P I Φ ΦA
T= +A Φ Φ Λ
=Λ 0
( )1( ) T
ptrace trace −= − AP I Φ Φ
( )1 Tp trace −= − ΦA Φ
( )( )1TTp trace
−
= − Φ Φ ΦΦ
Pf)
( )( )1T Tp trace
−
= − Φ ΦΦ Φ
( )mp trace= − I
p m= −( )p trace = − P m=
Regularization
( )((
1
)
2
)p
k
k
k fy=
− x
2
1
m
i
iiw=
Empirial
Er
Model s
penalror ts
yCo t = +
( )2
( )
1 1
( )p m
k
i
k
k
i
iwy = =
−
x
Penalize models withlarge weightsSSE
Regularization
( )((
1
)
2
)p
k
k
k fy=
− x
2
1
m
i
iiw=
Empirial
Er
Model s
penalror ts
yCo t = +
( )2
( )
1 1
( )p m
k
i
k
k
i
iwy = =
−
x
Penalize models withlarge weightsSSE
Without penalty (i=0), there are m degrees
of freedom to minimize SSE (Cost).
The effective number of parameters =m.
Regularization
( )((
1
)
2
)p
k
k
k fy=
− x
2
1
m
i
iiw=
Empirial
Er
Model s
penalror ts
yCo t = +
( )2
( )
1 1
( )p m
k
i
k
k
i
iwy = =
−
x
Penalize models withlarge weightsSSE
With penalty (i>0), the liberty to minimize SSE will be reduced.
The effective number of parameters <m.
Variance Estimation
2ˆ MSE =
Loss degrees of freedom
SSE
p =
−
Variance Estimation
2ˆ MSE =( )
SSE
trace=
P
Model Selection• Goal
– Choose the fittest model
• Criteria– Least prediction error
• Main Tools (Estimate Model Fitness)– Cross validation– Projection matrix
• Methods– Weight decay (Ridge regression)– Pruning and Growing RBFN’s
Empirical Error vs. Model Fitness
Minimize Prediction Error
• Ultimate Goal − Generalization
• Goal of Our Learning Procedure
Minimize Empirical Error
Minimize Prediction Error
(MSE)
Estimating Prediction Error
• When you have plenty of data use independent test sets– E.g., use the same training set to train
different models, and choose the best model by comparing on the test set.
• When data is scarce, use– Cross-Validation– Bootstrap
Cross Validation
• Simplest and most widely used method for estimating prediction error.
• Partition the original set into several different ways and to compute an average score over the different partitions, e.g.,– K-fold Cross-Validation– Leave-One-Out Cross-Validation– Generalize Cross-Validation
Global hyperplane Local receptive field
EBP LMS
Local minima Serious local minima
Smaller number of hidden
neurons
Larger number of hidden
neurons
Shorter computation time Longer computation time
Longer learning time Shorter learning time
Comparison of RBF and MLP
RBF networks MLP
Learning speed Very Fast Very Slow
Convergence Almost guarantee Not guarantee
Response time Slow Fast
Memory
requirement
Very large Small
Hardware
implementation
IBM ZISC036
Nestor Ni1000www-5.ibm.com/fr/cdlab/zisc.html
Voice Direct 364
www.sensoryinc.com
Generalization Usually better Usually poorer
Comparison of RBF and MLP