lecture 23: artificial neural networks - pha.jhu.eduneufeld/numerical/lecturenotes23.pdf•...
TRANSCRIPT
525
Lecture 23: Artificial neural networks
• Broad field that has developed over the past 20 to 30 years
• Confluence of statistical mechanics, applied math, biology and computers
• Original motivation: mathematical modeling of neurological networks
• Practical applications: pattern recognition (e.g. applied to speech, handwriting, underwriting)– used by USPS to read handwritten zip codes– Can be very fast, particularly when implemented in
special purpose hardware
526
Biological neuronsSchematic structure of a neuron (Cichocki and Unbehauen reproduced from Principles of Neurocomputing for Science & Engineering by Ham and Kostanic)
528
• Input signals come from the axons of other neurons, which connect to dendrites (input terminals) at the synapses
• If a sufficient excitatory signal is received, the neuron fires and sends an output signal along the axons
Biological neurons
Neuron
Dendrites Axons
Synapse can be excitatory or inhibitory
530
Mathematical model
• Nonlinear model of an artificial neuron
Σ
ξ1
ξ2
ξ3...ξN
w1
w2w3...wN
Input signal
Synaptic weights
g h
Activation or “threshold”function
O
Outputsignal
531
• Input and output signals are normalized, typically over the range [–1,+1] or [0, +1]
• Activation function can be – linear: g(h) = 0– step-like: g(h) = sgn(h)– sigmoid: g(h) = tanh(βh) or 1/(1+e–2βh)
• Neuron output, y = g(Σ(wi xi))
Mathematical model
532
Network architecture• Here, we confine our attention to feed-forward networks (a.k.a.
“perceptrons”) no feedback loops: transfer of information is unidirectional
Simplest example: 1-layer perceptron with N inputs (ξk with k = 1, N)connected to M outputs (Οi with i = 1, M) via MN synaptic weights Wik
ξ1
ξ2
ξ3
Ο1
Ο2
⎟⎟⎠
⎞⎜⎜⎝
⎛= ∑
= Nkkiki wgO
,1ξ
533
Example: the logical AND function
• Represent by a 1 x 3 perceptron
(example of setting thresholds with hardwired inputs)
ξ1
ξ2
ξ3
Ο1 = sgn(−1.5 + ξ1 +ξ2)
w1 = – 1.5
w2 = 1
w3 = 1
+1 always
–1 or +1
–1 or +1
534
Training the network
• How can we teach the network to yield desired responses?– Need a set of desired inputs and outputs:
ξiμ and ζi
μ, for μ=1, p– Need a measure of learning: the “cost
function”, E(w), that we seek to minimize
– Finding the best set of weights is then an MN dimensional minimization problem
Matrix of synaptic weights, wik
535
The cost function, E(w)2 common choices
1)
(mean square error)
2)
(relative entropy, for specific case of the tanhactivation function)
( ) ∑∑ ∑∑∑ ⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎠
⎞⎜⎝
⎛−=−=
ik
kiki
iiiMSE wgOE
μ
μμ
μ
μμ ξζζ2
2
21
21)(w
∑∑ ⎟⎟⎠
⎞⎜⎜⎝
⎛−−
−+++
+=i i
ii
i
iiRE OO
Eμ
μ
μμ
μ
μμ ζζζζ
)1()1(ln)1(
)1()1(ln)1(
21)(w
536
Minimizing E(w)
• Simple training method: start with initial guess of weights and change in accord with
(Move along direction of steepest descent)
ikik w
Ew
∂
∂−=Δ η
“Learning rate” ~ 0 to 1
537
Minimizing E(w)• The derivatives are easy to compute
which is conveniently written as
with
and for g=tanh(βh)for g = 1/(1+e–2βh)
( ) μμ
μ
μμ ξξζ kkk
ikiiik
MSE wgOw
E⎟⎠
⎞⎜⎝
⎛′−−=∂
∂ ∑∑μμ
μ
ξδ kiik
MSE
wE ∑−=∂
∂
( ) ⎟⎠
⎞⎜⎝
⎛′−= ∑ μμμμ ξζδ kk
ikiii wgO
)](1)[(2)( hghghg −=′ β)](1[)( 2 hghg −=′ β
538
Minimizing E(w)
• For the RE cost function, we find that
where, for the tanh activation function,
μμ
μ
ξδ kiik
RE
wE ∑−=
∂∂
( )βζδ μμμiii O−=
539
Minimizing E(w)
• So the procedure is to start with an initial set of starting weights and to change them iteratively according to
until the cost function changes by less than some preset amount.
μμμ ξηδ kiikw =Δ
540
Geometric interpretation
• For each output node, there is an N – 1 dimensional hyperplanewhich separates input values yielding O > 0 from those with O < 0
ξ1
ξ2
ξ3
541
Multi-layer networks• Interest in feed-forward networks was limited
until it was realized that 2-layer networks could describe any continuous function of the inputs
and a three-layer network can describe any function of the inputs
ξ1
ξ2
ξ3
V1
V2
O1
O2
O3
Hidden layer
542
Multi-layer networks
• Obviously, if the activation function is linear, the two-layer network is equivalent to a one-layer network with synaptic weights qik = Σjwjk Wij
• But for the sigmoid or step-function (sgn) activation function, interesting new behavior can result
543
Training multilayer perceptrons• As before we minimize the cost function, e.g.
but now we have to vary two sets of weights.
1) The derivatives w.r.t. Wik are
where
and
( )∑∑ −=i
iiMSE OEμ
μμζ 2
21),( wW
( ) μμ
μ
μμζ kkk
ikiiik
MSE VVWgOWE
⎟⎠
⎞⎜⎝
⎛′−−=∂
∂ ∑∑ μμ
μ
δ ki V∑−=
( ) ( )μμμμ ζδ iiii HgO ′−≡ μμk
kikk VWH ∑≡
544
Training multilayer perceptrons2) The derivatives w.r.t. wik are computed using the chain rule
For each layer, we update the weights by moving along the direction of steepest descent:
( ) ( ) ( ) μμμ
μ
μμ
μ
μ
μ
ξζ jjijiiii
ik
j
jik
MSE
hgWHgO
wV
VE
wE
′′−−=
∂
∂
∂∂
=∂
∂
∑∑
∑
ikik W
EW
∂
∂−=Δ η
ikik w
Ew
∂
∂−=Δ η
545
• So the training procedure is1. Initialize all weights to random values2. Propagate input signal forwards through network to
compute the intermediate and output signals 3. Compute the cost function and its derivatives w.r.t.
each weight, starting with the final layer and working backwards
4. Update all weights 5. Return to 2 (or stop if the convergence criterion is met)
Training multilayer perceptrons
546
Improvements in minimization• As we know from previous lectures, the method of
steepest descent can be very slow. We can use conjugate gradient or variable metric methods
• Alternatively, we can add a “momentum term” so that we include some of the previous step
where α is the momentum parameter (typically ~ 0.9 and must be between 0 and 1)
This can smooth the approach to the minimum. The best algorithms vary α and η as the minimum E is approached
previousnew ikik
ik wwE
w Δ+∂
∂−=Δ αη
547
Applications of ANN• Underwriting
Input: information about borrower/insuredOutput: loan/insurance outcomeTraining set: previous experience
• Speech and handwriting recognition• Financial predictions
Attempts to “beat” the stock market not successful: consistent with the “efficient market” hypothesis
• Forecasting: weather, solar flares• Diagnosis/classification: medical,
astronomical
548
Flexible networks
• While the number of input and output nodes is generally fixed by the problem, the number of hidden layers and nodes within them can be varied to reduce the cost function
• 3-layer perceptrons are always sufficient, although the use of more layers may reduce the required number of nodes
549
• Input: 16 x 16 pixel image: N = 256
• Output: 10 nodes: Ok = 1 if number is k and –1otherwise
ξ= 1 for pixels where ink is present, –1 otherwise
2256 ~ 1077
possible input states
Example: Handwriting recognition
550
Feed forward neural net developed by Le Cun et al. with four hidden layers
Training set: 104 images digitized from addresses on actual US mail
28 x 28 grayscale pixelsTest set (not used for training): ~ 3000
additional images4635 nodes, 98442 connections
Example: Handwriting recognition
551
Performance after 30 adaptation cycles:1.1% error on training set3.4% error on test set
Can achieve 1% error on test set if it rejects 5.7% of the characters
Example: Handwriting recognition
552
Example: photometric redshifts
• Use pattern recognition derive hard-to-measure parameter from observations of easy-to-measure parameter
• Sloan Digital Sky Survey provides broad-band photometric data (in 5 bands) for ~ 108 objects (mainly galaxies) and spectra for ~ 106
Spectra allow the redshift to be determined unequivocally, providing the distance and allowing the 3-D distribution to be determined
553
Example: photometric redshifts
• While only 1% of the objects are observed spectroscopically, “photometric redshifts” can be estimated for the other 99%
relative and absolute fluxes at 5 observed bands are correlated with z
• Collister and Lahav (2004, PASP, 116, 345) used artificial neural networks (e.g. 3-layer perceptron) to determine photometric redshifts: “ANNz” program
555
• CL04 used a 3-layer perceptron with a 5:10:10:1 architecture
• Training set: 104 galaxies with spectroscopic redshifts
• Used “committee” of five independently trained networks
• Cost function modified to prevent blowup of weights
Example: photometric redshifts