data mining & machine learning
TRANSCRIPT
Data Mining & Machine Learning
CS37300Purdue University
October 6, 2017
Extra Credit Competition Update
So far…
So far we have seen… we reviewed Naive Bayes Classifier and the Decision Tree…
We now embark on a quest to find other classifiers
Classifiers for today
• Nearest neighbors
• Linear Regression
• Support vector machines
• Logistic Regression (1-layer neural network)
Classification Task (with C classes)
! Data representation:
◦ Training set: Paired attribute vectors and class labels , , for some set
or table with one class label (y) and attributes (x)
! Knowledge representation: A function , parameterized by
! Model space: all values where ◦ Construct a model that approximates the mapping between and
●Classification: if y is categorical (e.g., {yes, no}, {dog, cat, elephant})
●Regression: if y is real-valued (e.g., stock prices)
(yi, xi)yi ∈ 𝕐, xi ∈ ℝd, d > 0 𝕐
n × p p − 1
f(x; θ) = y θ
θ f(x; θ) ∈ 𝕐
x y
Binary classification
! In its simplest form, a classification model defines a decision boundary (h) and labels for each side of the boundary
! Input: x={x1,x2,...,xn} is a set of attributes, function f assigns a label y to input x, where y is a discrete variable with a finite number of values
Nearest Neighbors
Nearest neighbor
• Instance-based method
• Learning
• Stores training data and delays processing until a new instance must be classified
• Assumes that all points are represented in p-dimensional space
• Prediction
• Nearest neighbors are calculated using Euclidean distance
• Classification is made based on class labels of neighbors
1NN
• Training set: (x1,y1), (x2,y2), ..., (xn,yn) , where s a feature vector of p continuous attributes and yi is a discrete class label
• 1NN algorithmTo predict a class label for new instance j:Find the training instance point xi such that d(xi, xj) is minimizedLet f(xj)= yi
• Key idea: Find instances that are “similar” to the new instance and use their class labels to make prediction for the new instance
• 1NN generalizes to kNN when more neighbors are considered
xi = [xi1, xi2, …, xip]
kNN model: decision boundaries
Source: http://cs231n.github.io/classification/
k
kNN
• kNN algorithmTo predict a class label for new instance j:Find the k nearest neighbors of j, i.e., those that minimize d(xk, xj)Let f(xj)= g( yk ), e.g., majority label in yk
• Algorithm choices
• How many neighbors to consider (i.e., choice of k)?... Usually a small value is used, e.g. k<10
• What distance measure d( ) to use? ... Euclidean norm distance is often used
• What function g( ) to combine the neighbors’ labels into a prediction? ... Majority vote is often used
L2
1NN decision boundary
• For each training example i, we can calculate its Voronoi cell, which corresponds to the space of points for which i is their nearest neighbor
• All points in such a Voronoi cell are labeled by the class of the training point, forming a Voronoi tessellation of the feature space
Source: http://www.cs.bilkent.edu.tr/~saksoy/courses/cs551-Spring2008/slides/cs551_nonbayesian1.pdf
Nearest neighbor
• Strengths:
• Simple model, easy to implement
• Very efficient learning: O(1)
• Weaknesses:
• Inefficient inference: time and space O(n)
• Curse of dimensionality:
• As number of features increase, you need an exponential increase in the size of the data to ensure that you have nearby examples for any given data point
- See python notebook on sphere volume
k-NN learning
• Parameters of the model:
• k (number of neighbors)
• any parameters of distance measure (e.g., weights on features)
• Model space
• Possible tessellations of the feature space
• Search algorithm
• Implicit search: choice of k, d, and g uniquely define a tessellation
• Score function
• Majority vote is minimizing misclassification rate
Least Squares Classifier
! Given x features of a car (length, width, mpg, maximum speed,…)! Classify cars into categories based on x
16
Motivation
Least Squares ClassifierTwo classes, cars:! is a real-valued vector (features of car ) ! is the class of car
! Find linear discriminant weights w
! Score function least squares error
! Search function: find , that minimize score
nxi iyi i
yi = {1 economy−1 luxury
, for i = 1,…, n
f(x) = wTx + b
score =n
∑i=1
(yi − f(xi))2
w b
Feature: Car max speed > 0f (x)
= 0f (x) < 0f (x)
f (x)∥x∥
−b
∥w∥
Feature: Car length
−4 −2 0 2 4 6 8
−8
−6
−4
−2
0
2
4
−4 −2 0 2 4 6 8
−8
−6
−4
−2
0
2
4
−4 −2 0 2 4 6 8
−8
−6
−4
−2
0
2
4
−4 −2 0 2 4 6 8
−8
−6
−4
−2
0
2
4
Least Squares Solution
18
Issues with Least Squares Classification
h+ε
ε
cares too much about well classified items
score =n
∑i=1
(yi − f(xi))2
Neural networks
• Analogous to biological systems
• Massive parallelism is computationally efficient
• First learning algorithm in 1959 (Rosenblatt)
• Perceptron learning rule
• Provide target outputs with inputs for a single neuron
• Incrementally update weights to learn to produce outputs
Neuron
26
CS590D 56
A Neuron
µk-
f
weighted
sum
Input
vector x
output y
Activation
function
weight
vector w
!
w0
w1
wn
x0
x1
xn
)sign(y
ExampleFor
n
0i
kii xw µ+= !=
CS590D 57
Multi-Layer Perceptron
Output nodes
Input nodes
Hidden nodes
Output vector
Input vector: xi
wij
! +=i
jiijj OwI θ
jIje
O−
+=
1
1
))(1( jjjjj OTOOErr −−=
jkk
kjjj wErrOOErr !−= )1(
ijijij OErrlww )(+=
jjj Errl)(+=θθ
b
PP
Activation function
f(x) = wT x + b
ParametersParameter (bias)
Logistic regression
! Task: Binary classification! Data representation: observations of attributes and label ! Knowledge representation: Two classes (y=0, y=1)
where
! Score function is the negative log-likelihood:
! Search: find , that minimize score
n xi ∈ ℝp y ∈ {0,1}
P(Y = 1 |X = x) = σ(wT x + b)
score = − log P({xi, yi}ni=1 |w, b)
= −n
∑i=1
1(yi = 1)log σ(wTxi + b) + 1(yi = 0)log(1 − σ(wTxi + b))
w b
Logistic function
Logistic (neuron) Activation (non-linear filter)
• If input is , the output willlook like a probability
• p(y = 1 | x; w) =
x
• We will represent the logistic function with the symbol:12
is one if a = b, zero otherwise1(a = b)
17
How to Deal with Multiple Classes?
• How to classify objects into multiple types?
18
Naïve Approach: one vs. many Classification
y(1)c =
(1 , if car c is “small”
�1 , if car c is “luxury”
y(2)c =
(1 , if car c is “small”
�1 , if car c is “medium”
y(3)c =
(1 , if car c is “medium”
�1 , if car c is “luxury”
Might work OK in some scenarios… but not clear in this case
luxury
19
Issue with using binary classifiers for K classes4.1. Discriminant Functions 183
R1
R2
R3
?
C1
not C1
C2
not C2
R1
R2
R3
?C1
C2
C1
C3
C2
C3
Figure 4.2 Attempting to construct a K class discriminant from a set of two class discriminants leads to am-biguous regions, shown in green. On the left is an example involving the use of two discriminants designed todistinguish points in class Ck from points not in class Ck. On the right is an example involving three discriminantfunctions each of which is used to separate a pair of classes Ck and Cj .
example involving three classes where this approach leads to regions of input spacethat are ambiguously classified.
An alternative is to introduce K(K − 1)/2 binary discriminant functions, onefor every possible pair of classes. This is known as a one-versus-one classifier. Eachpoint is then classified according to a majority vote amongst the discriminant func-tions. However, this too runs into the problem of ambiguous regions, as illustratedin the right-hand diagram of Figure 4.2.
We can avoid these difficulties by considering a single K-class discriminantcomprising K linear functions of the form
yk(x) = wTk x + wk0 (4.9)
and then assigning a point x to class Ck if yk(x) > yj(x) for all j "= k. The decisionboundary between class Ck and class Cj is therefore given by yk(x) = yj(x) andhence corresponds to a (D − 1)-dimensional hyperplane defined by
(wk − wj)Tx + (wk0 − wj0) = 0. (4.10)
This has the same form as the decision boundary for the two-class case discussed inSection 4.1.1, and so analogous geometrical properties apply.
The decision regions of such a discriminant are always singly connected andconvex. To see this, consider two points xA and xB both of which lie inside decisionregion Rk, as illustrated in Figure 4.3. Any point x̂ that lies on the line connectingxA and xB can be expressed in the form
x̂ = λxA + (1 − λ)xB (4.11)
small
medium
Uncertain classification region
Figure: C. Bishop
We will revisit multi-class classification when we see neural networks
Support vector machines (SVMs)
Support vector machines
• Discriminative model
• General idea:
• Find best boundary points (support vectors) and build classifier on top of them
• Linear and non-linear SVMs
Choosing hyperplanes to separate points
Source: Introduction to Data Mining, Tan, Steinbach, and Kumar
Among equivalent hyperplanes, choose one that maximizes “margin”
Source: Introduction to Data Mining, Tan, Steinbach, and Kumar
Linear SVMs
• Same functional form as perceptron
• Different learning procedure:Search for hyperplane with largest margin
• Margin=d+ + d- where d+ is distance to closest positive example and d- is distance to closest negative example
Linear Support Vector Machines
• d+ distance to closest positive example, d- distance to closest negative example.
• Define margin of separating hyperplane to be d++ d-
• SVMs look for hyperplane with the largest margin
o
x
o
oo
o
o
x
xx
x
x
x
d" d#
support +ectors
y = sign
"mX
i=1
wixi + b
#
Constrained optimization for SVMs
Eq1 : x(j) · w + b ⌅ +1 for y(j) = +1Eq2 : x(j) · w + b ⇤ �1 for y(j) = �1
Eq3 : y(j)(x(j) · w + b)� 1 ⌅ 0 ⇧y(j)
H1 : x(j) · w + b = +1H2 : x(j) · w + b = �1
d+ = d� =1
||w||
• Can maximize margin by minimizing ||w|| as it defines the hyperplanes
margin =2
||w||
Linear Support Vector Machines
• d+ distance to closest positive example, d- distance to closest negative example.
• Define margin of separating hyperplane to be d++ d-
• SVMs look for hyperplane with the largest margin
o
x
o
oo
o
o
x
xx
x
x
x
d" d#
support +ectors
H1
H2
Prediction constraint
Hyperplane boundaries
SVM optimization
• Search: Maximize margin by minimizing 0.5||w||2 subject to constraints on Eq3
• Note: Maximizing 2/||w|| is equivalent to minimizing 0.5||w||2
• Introduce Lagrange multipliers (α) for constraints into score function to minimize:
• Minimize LP with respect to w, b, and αN ≥0
• Convex programming problem
• Quadratic programming problem with parameters w, b, α
LP =12
||w||2 �I�
i=1
�iy(i)[x(i) · w + b] +I�
i=1
�i
Constrained optimization
• Linear programming (LP) is a technique for the optimization of a linear objective function, subject to linear constraints on the variables
• Quadratic programming (QP) is a technique for the optimization of a quadratic function of several variables, subject to linear constraints on these variables
SVM components
• Model space
• Set of weights w and b (hyperplane boundary)
• Search algorithm
• Quadratic programming to minimize Lp with constraints
• Score function
• Lp: maximizes margin subject to constraint that all training data is correctly classified
Limitations of linear SVMs
• Linear classifiers cannot deal with:
• Non-linear concepts
• Noisy data
• Solutions:
• Soft margin (e.g., allow mistakes in training data)
• Network of simple linear classifiers (e.g., neural networks)
• Map data into richer feature space (e.g., non-linear features) and then use linear classifier