function learning and neural nets
DESCRIPTION
Function Learning and Neural Nets. Setting. Learn a function with : Continuous-valued examples E.g., pixels of image Continuous-valued output E.g., likelihood that image is a ‘7’ Known as regression [ Regression can be turned into classification via thresholds]. f(x). x. - PowerPoint PPT PresentationTRANSCRIPT
2
SETTING
Learn a function with : Continuous-valued examples
E.g., pixels of image Continuous-valued output
E.g., likelihood that image is a ‘7’ Known as regression [Regression can be turned into classification
via thresholds]
3
FUNCTION-LEARNING (REGRESSION) FORMULATION
Goal function f Training set: (x(i),y(i)), i = 1,…,n, y(i)=f(x(i))
Inductive inference: find a function h that fits the points well
Same Keep-It-Simple bias
x
f(x)
4
LEAST-SQUARES FITTING
Hypothesize a class of functions g(x,θ) parameterized by θ
Minimize squared loss E(θ) = Σi ( g(x(i),θ)-y(i) )2
x
f(x)
5
LINEAR LEAST-SQUARES
g(x,θ) = x ∙ θ Value of θ that optimizes E(θ) is:
θ = [Σi x(i) ∙ y(i)] / [Σi x(i) ∙ x(i)]
E(θ) = Σi ( x(i)∙θ - y(i) )2
= Σi ( x(i) 2 θ 2 – 2 x(i)y(i) θ + y(i)2)
E’(θ) = 0 => d/d θ [Σi ( x(i) 2 θ 2 – 2 x(i) y(i) θ + y(i)2)] = Σi 2 x(i)2 θ – 2 x(i) y(i) = 0 => θ = [Σi x(i) ∙ y(i)] / [Σi x(i) ∙ x(i)]
x
f(x)g(x,q)
6
LINEAR LEAST-SQUARES WITH CONSTANT OFFSET
g(x,θ0,θ1) = θ0 + θ1 x
E(θ0,θ1) = Σi (θ0+θ1 x(i) - y(i) )2
= Σi (θ02 + θ1
2 x(i) 2 + y(i)2 +2θ0θ1x(i)-2θ0y(i)-2θ1x(i)y(i))
dE/dθ0(θ0*,θ1
*) = 0 and dE/dθ1(θ0*,θ1
*) = 0, so:0 = 2Σi (θ0
* +θ1*x(i) - y(i))
0 = 2Σi x(i)(θ0*+ θ1
* x(i) - y(i))
Verify the solution:θ0
* = 1/N Σi (y(i) – θ1*x(i))
θ1* = [N (Σi x(i)y(i)) – (Σi x(i))(Σi y(i))]/
[N (Σi x(i)2) – (Σi x(i))2]
x
f(x)g(x,q)
7
MULTI-DIMENSIONAL LEAST-SQUARES
Let x include attributes (x1,…,xN)
Let θ include coefficients (θ1,…,θN)
Model g(x,θ) = x1 θ1 + … + xN θN
x
f(x)g(x,q)
8
MULTI-DIMENSIONAL LEAST-SQUARES
g(x,θ) = x1 θ1 + … + xN θN
Best θ given by θ = (ATA)-1 AT b
Where A is matrix of x(i)’s in rows, b is vector of y(i)’s
x
f(x)g(x,q)
9
NONLINEAR LEAST-SQUARES
E.g. quadratic g(x,θ) = θ0 + x θ1 + x2 θ2
E.g. exponential g(x,θ) = exp(θ0 + x θ1) Any combinations
g(x,θ) = exp(θ0 + x θ1) + θ2 + x θ3
Fitting can be done using gradient descent
x
f(x)
linear
quadratic other
GRADIENT DESCENT
g(x,θ) = x1 θ1 + … + xN θN
Error E(θ) = Σi (g(x(i),θ)-y(i))2
Take derivative:dE(θ)/dθ = 2Σi dg(x(i),θ)/dθ (g(x(i),θ)-y(i))
Since dg(x(i),θ)/dθ = x(i),dE(θ)/dθ = 2Σi x(i)(g(x(i),θ)-y(i))
Update ruleθ θ - Σi x(i)(g(x(i),θ)-y(i))
Convergence to global minimum guaranteed (with chosen small enough) because E is a convex function
11
STOCHASTIC GRADIENT DESCENT
Prior rule was a batch update because all examples were incorporated in each step
Needs to store all prior examples Stochastic Gradient Descent: use single
example on each step Update rule:
Pick example i (either at random or in order) and a step size
Update ruleθ θ + x(i)(y(i)-g(x(i),θ))
Reduces error on i’th example… but does it converge?
12
PERCEPTRON(THE GOAL FUNCTION F IS A BOOLEAN ONE)
S gxi
x1
xn
ywi
y = g(Si=1,…,n wi xi)
+ +
+
++ -
-
--
-x1
x2
w1 x1 + w2 x2 = 0
13
PERCEPTRON(THE GOAL FUNCTION F IS A BOOLEAN ONE)
S gxi
x1
xn
ywi
y = g(Si=1,…,n wi xi)
+ +
+ +
+ -
-
--
-
?
14
PERCEPTRON LEARNING RULE
θ θ + x(i)(y(i)-g(θT x(i))) (g outputs either 0 or 1, y is either 0 or 1)
If output is correct, weights are unchanged If g is 0 but y is 1, then weight on attribute i
is increased If g is 1 but y is 0, then weight on attribute i
is decreased
Converges if data is linearly separable, but oscillates otherwise
16
A SINGLE NEURON CAN LEARN
S gxi
x1
xn
ywi
A disjunction of boolean literals x1 x2 x3
Majority function
XOR?
17
NEURAL NETWORK
Network of interconnected neurons
S gxi
x1
xn
ywi
S gxi
x1
xn
ywi
Acyclic (feed-forward) vs. recurrent networks
19
NETWORKS WITH HIDDEN LAYERS
Can learn XORs, other nonlinear functions As the number of hidden units increase, so
does the network’s capacity to learn functions with more nonlinear features
Difficult to characterize which class of functions!
How to train hidden layers?
20
BACKPROPAGATION (PRINCIPLE)
New example y(k) = f(x(k)) φ(k) = outcome of NN with weights w(k-1) for
inputs x(k) Error function: E(k)(w(k-1)) = (φ(k) – y(k))2
wij(k) = wij
(k-1) – ε∙E(k)/wij (w(k) = w(k-1) - e∙E)
Backpropagation algorithm: Update the weights of the inputs to the last layer, then the weights of the inputs to the previous layer, etc.
24
LEARNING ALGORITHM
Given many examples (x(1),y(1)),…, (x(N),y(N)) a learning rate e
Init: Set k = 0 (or rand(1,N)) Repeat:
Tweak weights with a backpropagation update on example x(k), y(k)
Set k = k+1 (or rand(1,N))
25
UNDERSTANDING BACKPROPAGATION
Example of Stochastic Gradient Descent Decompose E(q) = e1(q)+e2(q)+…+eN(q)
Here ek = (g(x(k),q)-y(k))2
On each iteration take a step to reduce ek
E(q)
q
Gradient of e1
26
UNDERSTANDING BACKPROPAGATION
Example of Stochastic Gradient Descent Decompose E(q) = e1(q)+e2(q)+…+eN(q)
Here ek = (g(x(k),q)-y(k))2
On each iteration take a step to reduce ek
E(q)
q
Gradient of e1
27
UNDERSTANDING BACKPROPAGATION
Example of Stochastic Gradient Descent Decompose E(q) = e1(q)+e2(q)+…+eN(q)
Here ek = (g(x(k),q)-y(k))2
On each iteration take a step to reduce ek
E(q)
q
Gradient of e2
28
UNDERSTANDING BACKPROPAGATION
Example of Stochastic Gradient Descent Decompose E(q) = e1(q)+e2(q)+…+eN(q)
Here ek = (g(x(k),q)-y(k))2
On each iteration take a step to reduce ek
E(q)
q
Gradient of e2
29
UNDERSTANDING BACKPROPAGATION
Example of Stochastic Gradient Descent Decompose E(q) = e1(q)+e2(q)+…+eN(q)
Here ek = (g(x(k),q)-y(k))2
On each iteration take a step to reduce ek
E(q)
q
Gradient of e3
30
UNDERSTANDING BACKPROPAGATION
Example of Stochastic Gradient Descent Decompose E(q) = e1(q)+e2(q)+…+eN(q)
Here ek = (g(x(k),q)-y(k))2
On each iteration take a step to reduce ek
E(q)
q
Gradient of e3
STOCHASTIC GRADIENT DESCENT
Objective function values (measured over all examples) over time settle into local minimum
Step size must be reduced over time, e.g., O(1/t)
31
33
COMMENTS AND ISSUES
How to choose the size and structure of networks? If network is too large, risk of over-fitting (data
caching) If network is too small, representation may not
be rich enough Role of representation: e.g., learn the
concept of an odd number Incremental learning Low interpretability
34
PERFORMANCE OF FUNCTION LEARNING
Overfitting: too many parameters Regularization: penalize large parameter
values Efficient optimization
If E(q) is nonconvex, can only guarantee finding a local minimum
Batch updates are expensive, stochastic updates converge slowly