artificial intelligence statistical learning methods chapter 20, aima 2 nd ed. chapter 18, aima 3 rd...
TRANSCRIPT
Artificial Intelligence
Statistical learning methodsChapter 20, AIMA 2nd ed.Chapter 18, AIMA 3rd ed.
(only ANNs & SVMs)
Artificial neural networks
The brain is a pretty intelligent system.
Can we ”copy” it?
There are approx. 1011 neurons in the brain.
There are approx. 23109 neurons in the male cortex (females have about 15% less).
The simple model
• The McCulloch-Pitts model (1943)
Image from Neuroscience: Exploring the brain by Bear, Connors, and Paradiso
y = g(w0+w1x1+w2x2+w3x3)
w1
w2
w3
Neuron firing rate
Firi
ng r
ate
Depolarization
Transfer functions g(z)
The Heaviside function The logistic function
The simple perceptron
With {-1,+1} representation
Traditionally (early 60:s) trained with Perceptron learning.
0 if1
0 if1]sgn[)(
y
T
TT
xw
xwxwx
22110 xwxwwT xw
Perceptron learning
Repeat until no errors are made anymore1. Pick a random example [x(n),f(n)]
2. If the classification is correct, i.e. if y(x(n)) = f(n) , then do nothing
3. If the classification is wrong, then do the following update to the parameters (, the learning rate, is a small positive number)
Bn
Annf
class tobelongs )( if1
class tobelongs )( if1)(
x
xDesired output
)()( nxnfww iii
Example: Perceptron learning
The AND function
x1
x2x1 x2 f
0 0 -1
0 1 -1
1 0 -1
1 1 +1
Initial values:
= 0.3
1
1
5.0
w
Example: Perceptron learning
The AND function
x1
x2x1 x2 f
0 0 -1
0 1 -1
1 0 -1
1 1 +1
1
1
5.0
w
This one is correctlyclassified, no action.
Example: Perceptron learning
The AND function
x1
x2x1 x2 f
0 0 -1
0 1 -1
1 0 -1
1 1 +1
1
1
5.0
w
This one is incorrectlyclassified, learning action.
7.01
10
8.01
22
11
00
ww
ww
ww
Example: Perceptron learning
The AND function
x1
x2x1 x2 f
0 0 -1
0 1 -1
1 0 -1
1 1 +1
7.0
1
8.0
w
This one is incorrectlyclassified, learning action.
7.01
10
8.01
22
11
00
ww
ww
ww
Example: Perceptron learning
The AND function
x1
x2x1 x2 f
0 0 -1
0 1 -1
1 0 -1
1 1 +1
7.0
1
8.0
w
This one is correctlyclassified, no action.
Example: Perceptron learning
The AND function
x1
x2x1 x2 f
0 0 -1
0 1 -1
1 0 -1
1 1 +1This one is incorrectlyclassified, learning action.
7.00
7.01
1.11
22
11
00
ww
ww
ww
7.0
1
8.0
w
Example: Perceptron learning
The AND function
x1
x2x1 x2 f
0 0 -1
0 1 -1
1 0 -1
1 1 +1
7.0
7.0
1.1
w
This one is incorrectlyclassified, learning action.
7.00
7.01
1.11
22
11
00
ww
ww
ww
Example: Perceptron learning
The AND function
x1
x2x1 x2 f
0 0 -1
0 1 -1
1 0 -1
1 1 +1
7.0
7.0
1.1
w Final solution
Perceptron learning
• Perceptron learning is guaranteed to find a solution in finite time, if a solution exists.
• Perceptron learning cannot be generalized to more complex networks.
• Better to use gradient descent – based on formulating an error and differentiable functions
N
n
nynfE1
2),()()( WW
Gradient search
)(WW EW
W
E(W)
“Go downhill”
The “learning rate” () is set heuristically
W(k)
W(k+1) = W(k) + W(k)
The Multilayer Perceptron (MLP)
• Combine several single layer perceptrons.
• Each single layer perceptron uses a sigmoid function (C)E.g.
x k
h j
h i
y l 1)exp(1)(
)tanh()(
zz
zz
outp
ut
inpu
t
Can be trained using gradient descent
Example: One hidden layer
• Can approximate any continuous function
(z) = sigmoid or linear, (z) = sigmoid.
x k
h j
y i
D
kkjkjj
J
jjijii
xwwh
hvvy
10
10
)(
)()(
x
xx
Training: Backpropagation(Gradient descent)
ijijij
N
nkjjk
N
njiij
w
tnhtvtntn
nxtntw
tnhtntv
tEt
]),([)(),(),(
where
)(),()(
]),([),()(
)()(
'
1
1
x
x
w
Support vector machines
Linear classifier on a linearly separable problem
There are infinitely manylines that have zero trainingerror.
Which line should we choose?
There are infinitely manylines that have zero trainingerror.
Which line should we choose?
Choose the line with thelargest margin.
The “large margin classifier”
margin
Linear classifier on a linearly separable problem
There are infinitely manylines that have zero trainingerror.
Which line should we choose?
Choose the line with thelargest margin.
The “large margin classifier”
margin
Linear classifier on a linearly separable problem
”Support vectors”
The plane separating and is defined by
The dashed planes are given by
margin
Computing the margin
w
aT xw
ba
baT
T
xw
xw
Divide by b
Define new w = w/b and = a/bmargin
Computing the margin
w 1//
1//
bab
babT
T
xw
xw
1
1
xw
xwT
T
We have defined a scalefor w and b
We have
which gives
margin
Computing the margin
margin
1)(
1
w
wxw
xw
T
T
w
x
x + w
w
2margin
w
Maximizing the margin isequal to minimizing
||w||
subject to the constraints
wTx(n) – +1 for all
wTx(n) – -1 for all
Quadratic programming problem, constraints can be included with Lagrange multipliers.
Linear classifier on a linearly separable problem
N
n
N
n
N
m
TmnnD mnmynyL
1 1 1
)()()()(2
1xx
Quadratic programming problem
N
n
Tnp nnyL
1
21)()(
2
1 xww
Minimize cost (Lagrangian)
Minimum of Lp occurs at the maximum of (the Wolfe dual)
Only scalar productin cost. IMPORTANT!
sn
Tn
T nnyy xxxwx )()(sgnsgn)(ˆ
Linear Support Vector Machine
Test phase, the predicted output
Where is determined e.g. by looking at one of the support vectors.
Still only scalar products in the expression.
How deal with nonlinear case?
• Project data into high-dimensional space Z. There we know that it will be linearly separable (due to VC dimension of linear classifier).
• We don’t even have to know the projection...!
)]([)( nn xz
Scalar product kernel trick
))(())(())(),(( mnmnK T xφxφxx
If we can find kernel such that
Then we don’t even have to know the mappingto solve the problem...
Valid kernels (Mercer’s theorem)
)](),([)]2(),([)]1(),([
)](),2([)]2(),2([)]1(),2([
)](),1([)]2(),1([)]1(),1([
NNKNKNK
NKKK
NKKK
xxxxxx
xxxxxx
xxxxxx
K
Define the matrix
If K is symmetric, K = KT, and positive semi-definite, thenK[x(i),x(j)] is a valid kernel.
Examples of kernels
dT jijiK
jijiK
)()()](),([
2/)()(exp)](),([2
xxxx
xxxx
First, Gaussian kernel. Second, polynomial kernel. With d=1 we have linear SVM.
Linear SVM often used with good success on highdimensional data (e.g. text classification).
Example: Robot color vision(Competition 1999)
Classify the Lego pieces into red, blue, and yellow.Classify white balls, black sideboard, and green carpet.
What the camera sees (RGB space)
Yellow
Green
Red
Mapping RGB (3D) to rgb (2D)
BGR
Bb
BGR
Gg
BGR
Rr
Lego in normalized rgb space
2Xx 6Cc
Input is 2D
x1
x2
Output is 6D:{red, blue, yellow,green, black, white}
MLP classifier
E_train = 0.21%E_test = 0.24%2-3-1 MLP
Levenberg-Marquardt
Training time(150 epochs):51 seconds
SVM classifier
E_train = 0.19%E_test = 0.20%SVM with
= 1000
2)(exp),( yxyx K
Training time:22 seconds
Machine Learning
• Machine learning (multilayer perceptrons, support vector machines, clustering) is covered in great detail in the course ”Learning Systems”.