machine learning
DESCRIPTION
statistical machine learningTRANSCRIPT
-
CSE 575: Sta*s*cal Machine Learning
Jingrui He CIDSE, ASU
-
Instance-based Learning
-
3
1-Nearest Neighbor
Four things make a memory based learner: 1. A distance metric
Euclidian (and many more) 2. How many nearby neighbors to look at? One 1. A weigh:ng func:on (op:onal)
Unused
2. How to t with the local points? Just predict the same output as the nearest neighbor.
-
4
Consistency of 1-NN Consider an es*mator fn trained on n examples
e.g., 1-NN, regression, ... Es*mator is consistent if true error goes to zero as amount of
data increases e.g., for no noise data, consistent if:
Regression is not consistent! Representa*on bias
1-NN is consistent (under some mild neprint)
What about variance???
-
5
1-NN overts?
-
6
k-Nearest Neighbor Four things make a memory based learner: 1. A distance metric
Euclidian (and many more) 2. How many nearby neighbors to look at?
k 1. A weigh:ng func:on (op:onal)
Unused
2. How to t with the local points? Just predict the average output among the k nearest neighbors.
-
7
k-Nearest Neighbor (here k=9)
K-nearest neighbor for funcFon Hng smooth away noise, but there are clear deciencies. What can we do about all the discon*nui*es that k-NN gives us?
-
8
Curse of dimensionality for instance-based learning
Must store and retrieve all data! Most real work done during tes*ng For every test sample, must search through all dataset very slow! There are fast methods for dealing with large datasets, e.g., tree-based
methods, hashing methods,
Instance-based learning o^en poor with noisy or irrelevant features
-
Support Vector Machines
-
w.x = j w(j) x(j) 10
Linear classiers Which line is beber?
Data:
Example i:
-
11
Pick the one with the largest margin!
w.x = j w(j) x(j)
w.x + b = 0
-
12
Maximize the margin
w.x + b = 0
-
13
But there are a many planes
w.x + b = 0
-
14
w.x + b = 0
Review: Normal to a plane
-
15
w.x + b = +1
w.x + b = -1
w.x + b = 0
margin 2
x- x+
Normalized margin Canonical hyperplanes
-
16
Normalized margin Canonical hyperplanes
w.x + b = +1
w.x + b = -1
w.x + b = 0
margin 2
x- x+
-
17
Margin maximiza*on using canonical hyperplanes
w.x + b = +1
w.x + b = -1
w.x + b = 0
margin 2
-
18
Support vector machines (SVMs)
w.x + b = +1
w.x + b = -1
w.x + b = 0
margin 2
Solve eciently by quadra*c programming (QP) Well-studied solu*on algorithms
Hyperplane dened by support vectors
-
19
What if the data is not linearly separable?
Use features of features of features of features.
-
20
What if the data is s*ll not linearly separable?
Minimize w.w and number of training mistakes Tradeo two criteria?
Tradeo #(mistakes) and w.w 0/1 loss Slack penalty C Not QP anymore Also doesnt dis*nguish near misses
and really bad mistakes
-
21
Slack variables Hinge loss
If margin 1, dont care If margin < 1, pay linear penalty
-
22
Side note: Whats the dierence between SVMs and logis*c regression?
SVM: LogisFc regression:
Log loss:
-
23
Constrained op*miza*on
-
24
Lagrange mul*pliers Dual variables
Moving the constraint to objecFve funcFon Lagrangian:
Solve:
-
25
Lagrange mul*pliers Dual variables
Solving:
-
26
Dual SVM deriva*on (1) the linearly separable case
-
27
Dual SVM deriva*on (2) the linearly separable case
-
28
Dual SVM interpreta*on
w.x + b = 0
-
29
Dual SVM formula*on the linearly separable case
-
30
Dual SVM deriva*on the non-separable case
-
31
Dual SVM formula*on the non-separable case
-
32
Why did we learn about the dual SVM?
There are some quadra*c programming algorithms that can solve the dual faster than the primal
But, more importantly, the kernel trick!!! Another lible detour
-
33
Reminder from last *me: What if the data is not linearly separable?
Use features of features of features of features.
Feature space can get really large really quickly!
-
34
Higher order polynomials
number of input dimensions
numbe
r of m
onom
ial terms
d=2
d=4
d=3
m input features d degree of polynomial
grows fast! d = 6, m = 100 about 1.6 billion terms
-
35
Dual formula*on only depends on dot-products, not on w!
-
36
Dot-product of polynomials
-
37
Finally: the kernel trick!
Never represent features explicitly Compute dot products in closed form
Constant-*me high-dimensional dot-products for many classes of features
Very interes*ng theory Reproducing Kernel Hilbert Spaces
-
38
Polynomial kernels
All monomials of degree d in O(d) opera*ons:
How about all monomials of degree up to d? Solu*on 0:
Beber solu*on:
-
39
Common kernels
Polynomials of degree d
Polynomials of degree up to d
Gaussian kernels
Sigmoid
-
40
Overvng?
Huge feature space with kernels, what about overvng??? Maximizing margin leads to sparse set of support vectors
Some interes*ng theory says that SVMs search for simple hypothesis with large margin
O^en robust to overvng
-
41
What about at classica*on *me For a new input x, if we need to represent (x), we are in trouble!
Recall classier: sign(w.(x)+b) Using kernels we are cool!
-
42
SVMs with kernels Choose a set of features and kernel func*on Solve dual problem to obtain support vectors i
At classica*on *me, compute:
Classify as
-
43
Whats the dierence between SVMs and Logis*c Regression?
SVMs Logistic Regression
Loss function Hinge loss Log-loss
High dimensional features with kernels
Yes! No
-
44
Kernels in logis*c regression
Dene weights in terms of support vectors:
Derive simple gradient descent rule on i
-
45
Whats the dierence between SVMs and Logis*c Regression? (Revisited)
SVMs Logistic Regression
Loss function Hinge loss Log-loss
High dimensional features with kernels
Yes! Yes!
Solution sparse Often yes! Almost always no!
Semantics of output
Margin Real probabilities