data mining & machine learning

Data Mining & Machine Learning

CS37300Purdue University

October 6, 2017

Extra Credit Competition Update

So far…

So far we have seen… we reviewed Naive Bayes Classifier and the Decision Tree…

We now embark on a quest to find other classifiers

Classifiers for today

• Nearest neighbors

• Linear Regression

• Support vector machines

• Logistic Regression (1-layer neural network)

Classification Task (with C classes)

! Data representation:

◦ Training set: Paired attribute vectors and class labels , , for some set

or table with one class label (y) and attributes (x)

! Knowledge representation: A function , parameterized by

! Model space: all values where ◦ Construct a model that approximates the mapping between and

●Classification: if y is categorical (e.g., {yes, no}, {dog, cat, elephant})

●Regression: if y is real-valued (e.g., stock prices)

(yi, xi)yi ∈ 𝕐, xi ∈ ℝd, d > 0 𝕐

n × p p − 1

f(x; θ) = y θ

θ f(x; θ) ∈ 𝕐

x y

Binary classification

! In its simplest form, a classification model defines a decision boundary (h) and labels for each side of the boundary

! Input: x={x1,x2,...,xn} is a set of attributes, function f assigns a label y to input x, where y is a discrete variable with a finite number of values

Nearest Neighbors

Nearest neighbor

• Instance-based method

• Learning

• Stores training data and delays processing until a new instance must be classified

• Assumes that all points are represented in p-dimensional space

• Prediction

• Nearest neighbors are calculated using Euclidean distance

• Classification is made based on class labels of neighbors

1NN

• Training set: (x1,y1), (x2,y2), ..., (xn,yn) , where s a feature vector of p continuous attributes and yi is a discrete class label

• 1NN algorithmTo predict a class label for new instance j:Find the training instance point xi such that d(xi, xj) is minimizedLet f(xj)= yi

• Key idea: Find instances that are “similar” to the new instance and use their class labels to make prediction for the new instance

• 1NN generalizes to kNN when more neighbors are considered

xi = [xi1, xi2, …, xip]

kNN model: decision boundaries

Source: http://cs231n.github.io/classification/

k

kNN

• kNN algorithmTo predict a class label for new instance j:Find the k nearest neighbors of j, i.e., those that minimize d(xk, xj)Let f(xj)= g( yk ), e.g., majority label in yk

• Algorithm choices

• How many neighbors to consider (i.e., choice of k)?... Usually a small value is used, e.g. k<10

• What distance measure d( ) to use? ... Euclidean norm distance is often used

• What function g( ) to combine the neighbors’ labels into a prediction? ... Majority vote is often used

L2

1NN decision boundary

• For each training example i, we can calculate its Voronoi cell, which corresponds to the space of points for which i is their nearest neighbor

• All points in such a Voronoi cell are labeled by the class of the training point, forming a Voronoi tessellation of the feature space

Source: http://www.cs.bilkent.edu.tr/~saksoy/courses/cs551-Spring2008/slides/cs551_nonbayesian1.pdf

Nearest neighbor

• Strengths:

• Simple model, easy to implement

• Very efficient learning: O(1)

• Weaknesses:

• Inefficient inference: time and space O(n)

• Curse of dimensionality:

• As number of features increase, you need an exponential increase in the size of the data to ensure that you have nearby examples for any given data point

- See python notebook on sphere volume

k-NN learning

• Parameters of the model:

• k (number of neighbors)

• any parameters of distance measure (e.g., weights on features)

• Model space

• Possible tessellations of the feature space

• Search algorithm

• Implicit search: choice of k, d, and g uniquely define a tessellation

• Score function

• Majority vote is minimizing misclassification rate

Least Squares Classifier

! Given x features of a car (length, width, mpg, maximum speed,…)! Classify cars into categories based on x

16

Motivation

Least Squares ClassifierTwo classes, cars:! is a real-valued vector (features of car ) ! is the class of car

! Find linear discriminant weights w

! Score function least squares error

! Search function: find , that minimize score

nxi iyi i

yi = {1 economy−1 luxury

, for i = 1,…, n

f(x) = wTx + b

score =n

∑i=1

(yi − f(xi))2

w b

Feature: Car max speed > 0f (x)

= 0f (x) < 0f (x)

f (x)∥x∥

−b

∥w∥

Feature: Car length

−4 −2 0 2 4 6 8

−8

−6

−4

−2

0

2

4

−4 −2 0 2 4 6 8

−8

−6

−4

−2

0

2

4

−4 −2 0 2 4 6 8

−8

−6

−4

−2

0

2

4

−4 −2 0 2 4 6 8

−8

−6

−4

−2

0

2

4

Least Squares Solution

18

Issues with Least Squares Classification

h+ε

ε

cares too much about well classified items

score =n

∑i=1

(yi − f(xi))2

Neural networks

• Analogous to biological systems

• Massive parallelism is computationally efficient

• First learning algorithm in 1959 (Rosenblatt)

• Perceptron learning rule

• Provide target outputs with inputs for a single neuron

• Incrementally update weights to learn to produce outputs

Neuron

26

CS590D 56

A Neuron

µk-

f

weighted

sum

Input

vector x

output y

Activation

function

weight

vector w

!

w0

w1

wn

x0

x1

xn

)sign(y

ExampleFor

n

0i

kii xw µ+= !=

CS590D 57

Multi-Layer Perceptron

Output nodes

Input nodes

Hidden nodes

Output vector

Input vector: xi

wij

! +=i

jiijj OwI θ

jIje

O−

+=

1

1

))(1( jjjjj OTOOErr −−=

jkk

kjjj wErrOOErr !−= )1(

ijijij OErrlww )(+=

jjj Errl)(+=θθ

b

PP

Activation function

f(x) = wT x + b

ParametersParameter (bias)

Logistic regression

! Task: Binary classification! Data representation: observations of attributes and label ! Knowledge representation: Two classes (y=0, y=1)

where

! Score function is the negative log-likelihood:

! Search: find , that minimize score

n xi ∈ ℝp y ∈ {0,1}

P(Y = 1 |X = x) = σ(wT x + b)

score = − log P({xi, yi}ni=1 |w, b)

= −n

∑i=1

1(yi = 1)log σ(wTxi + b) + 1(yi = 0)log(1 − σ(wTxi + b))

w b

Logistic function

Logistic (neuron) Activation (non-linear filter)

• If input is , the output willlook like a probability

• p(y = 1 | x; w) =

x

• We will represent the logistic function with the symbol:12

is one if a = b, zero otherwise1(a = b)

17

How to Deal with Multiple Classes?

• How to classify objects into multiple types?

18

Naïve Approach: one vs. many Classification

y(1)c =

(1 , if car c is “small”

�1 , if car c is “luxury”

y(2)c =

(1 , if car c is “small”

�1 , if car c is “medium”

y(3)c =

(1 , if car c is “medium”

�1 , if car c is “luxury”

Might work OK in some scenarios… but not clear in this case

luxury

19

Issue with using binary classifiers for K classes4.1. Discriminant Functions 183

R1

R2

R3

?

C1

not C1

C2

not C2

R1

R2

R3

?C1

C2

C1

C3

C2

C3

Figure 4.2 Attempting to construct a K class discriminant from a set of two class discriminants leads to am-biguous regions, shown in green. On the left is an example involving the use of two discriminants designed todistinguish points in class Ck from points not in class Ck. On the right is an example involving three discriminantfunctions each of which is used to separate a pair of classes Ck and Cj .

example involving three classes where this approach leads to regions of input spacethat are ambiguously classified.

An alternative is to introduce K(K − 1)/2 binary discriminant functions, onefor every possible pair of classes. This is known as a one-versus-one classifier. Eachpoint is then classified according to a majority vote amongst the discriminant func-tions. However, this too runs into the problem of ambiguous regions, as illustratedin the right-hand diagram of Figure 4.2.

We can avoid these difficulties by considering a single K-class discriminantcomprising K linear functions of the form

yk(x) = wTk x + wk0 (4.9)

and then assigning a point x to class Ck if yk(x) > yj(x) for all j "= k. The decisionboundary between class Ck and class Cj is therefore given by yk(x) = yj(x) andhence corresponds to a (D − 1)-dimensional hyperplane defined by

(wk − wj)Tx + (wk0 − wj0) = 0. (4.10)

This has the same form as the decision boundary for the two-class case discussed inSection 4.1.1, and so analogous geometrical properties apply.

The decision regions of such a discriminant are always singly connected andconvex. To see this, consider two points xA and xB both of which lie inside decisionregion Rk, as illustrated in Figure 4.3. Any point x̂ that lies on the line connectingxA and xB can be expressed in the form

x̂ = λxA + (1 − λ)xB (4.11)

small

medium

Uncertain classification region

Figure: C. Bishop

We will revisit multi-class classification when we see neural networks

Support vector machines (SVMs)

Support vector machines

• Discriminative model

• General idea:

• Find best boundary points (support vectors) and build classifier on top of them

• Linear and non-linear SVMs

Choosing hyperplanes to separate points

Source: Introduction to Data Mining, Tan, Steinbach, and Kumar

http://www.cs.bilkent.edu.tr/~saksoy/courses/cs551-Spring2008/slides/cs551_nonbayesian1.pdf

Among equivalent hyperplanes, choose one that maximizes “margin”

Source: Introduction to Data Mining, Tan, Steinbach, and Kumar

http://www.cs.bilkent.edu.tr/~saksoy/courses/cs551-Spring2008/slides/cs551_nonbayesian1.pdf

Linear SVMs

• Same functional form as perceptron

• Different learning procedure:Search for hyperplane with largest margin

• Margin=d+ + d- where d+ is distance to closest positive example and d- is distance to closest negative example

Linear Support Vector Machines

• d+ distance to closest positive example, d- distance to closest negative example.

• Define margin of separating hyperplane to be d++ d-

• SVMs look for hyperplane with the largest margin

o

x

o

oo

o

o

x

xx

x

x

x

d" d#

support +ectors

y = sign

"mX

i=1

wixi + b

#

Constrained optimization for SVMs

Eq1 : x(j) · w + b ⌅ +1 for y(j) = +1Eq2 : x(j) · w + b ⇤ �1 for y(j) = �1

Eq3 : y(j)(x(j) · w + b)� 1 ⌅ 0 ⇧y(j)

H1 : x(j) · w + b = +1H2 : x(j) · w + b = �1

d+ = d� =1

||w||

• Can maximize margin by minimizing ||w|| as it defines the hyperplanes

margin =2

||w||

Linear Support Vector Machines

• d+ distance to closest positive example, d- distance to closest negative example.

• Define margin of separating hyperplane to be d++ d-

• SVMs look for hyperplane with the largest margin

o

x

o

oo

o

o

x

xx

x

x

x

d" d#

support +ectors

H1

H2

Prediction constraint

Hyperplane boundaries

SVM optimization

• Search: Maximize margin by minimizing 0.5||w||2 subject to constraints on Eq3

• Note: Maximizing 2/||w|| is equivalent to minimizing 0.5||w||2

• Introduce Lagrange multipliers (α) for constraints into score function to minimize:

• Minimize LP with respect to w, b, and αN ≥0

• Convex programming problem

• Quadratic programming problem with parameters w, b, α

LP =12

||w||2 �I�

i=1

�iy(i)[x(i) · w + b] +I�

i=1

�i

Constrained optimization

• Linear programming (LP) is a technique for the optimization of a linear objective function, subject to linear constraints on the variables

• Quadratic programming (QP) is a technique for the optimization of a quadratic function of several variables, subject to linear constraints on these variables

SVM components

• Model space

• Set of weights w and b (hyperplane boundary)

• Search algorithm

• Quadratic programming to minimize Lp with constraints

• Score function

• Lp: maximizes margin subject to constraint that all training data is correctly classified

Limitations of linear SVMs

• Linear classifiers cannot deal with:

• Non-linear concepts

• Noisy data

• Solutions:

• Soft margin (e.g., allow mistakes in training data)

• Network of simple linear classifiers (e.g., neural networks)

• Map data into richer feature space (e.g., non-linear features) and then use linear classifier

data mining & machine learning

Documents