data mining & machine learning

35
Data Mining & Machine Learning CS37300 Purdue University October 6, 2017

Upload: others

Post on 03-Dec-2021

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Mining & Machine Learning

Data Mining & Machine Learning

CS37300Purdue University

October 6, 2017

Page 2: Data Mining & Machine Learning

Extra Credit Competition Update

Page 3: Data Mining & Machine Learning

So far…

So far we have seen… we reviewed Naive Bayes Classifier and the Decision Tree…

We now embark on a quest to find other classifiers

Page 4: Data Mining & Machine Learning

Classifiers for today

• Nearest neighbors

• Linear Regression

• Support vector machines

• Logistic Regression (1-layer neural network)

Page 5: Data Mining & Machine Learning

Classification Task (with C classes)

! Data representation:

◦ Training set: Paired attribute vectors and class labels , , for some set

or table with one class label (y) and attributes (x)

! Knowledge representation: A function , parameterized by

! Model space: all values where ◦ Construct a model that approximates the mapping between and

●Classification: if y is categorical (e.g., {yes, no}, {dog, cat, elephant})

●Regression: if y is real-valued (e.g., stock prices)

(yi, xi)yi ∈ 𝕐, xi ∈ ℝd, d > 0 𝕐

n × p p − 1

f(x; θ) = y θ

θ f(x; θ) ∈ 𝕐

x y

Page 6: Data Mining & Machine Learning

Binary classification

! In its simplest form, a classification model defines a decision boundary (h) and labels for each side of the boundary

! Input: x={x1,x2,...,xn} is a set of attributes, function f assigns a label y to input x, where y is a discrete variable with a finite number of values

Page 7: Data Mining & Machine Learning

Nearest Neighbors

Page 8: Data Mining & Machine Learning

Nearest neighbor

• Instance-based method

• Learning

• Stores training data and delays processing until a new instance must be classified

• Assumes that all points are represented in p-dimensional space

• Prediction

• Nearest neighbors are calculated using Euclidean distance

• Classification is made based on class labels of neighbors

Page 9: Data Mining & Machine Learning

1NN

• Training set: (x1,y1), (x2,y2), ..., (xn,yn) , where s a feature vector of p continuous attributes and yi is a discrete class label

• 1NN algorithmTo predict a class label for new instance j:Find the training instance point xi such that d(xi, xj) is minimizedLet f(xj)= yi

• Key idea: Find instances that are “similar” to the new instance and use their class labels to make prediction for the new instance

• 1NN generalizes to kNN when more neighbors are considered

xi = [xi1, xi2, …, xip]

Page 10: Data Mining & Machine Learning

kNN model: decision boundaries

Source: http://cs231n.github.io/classification/

k

Page 11: Data Mining & Machine Learning

kNN

• kNN algorithmTo predict a class label for new instance j:Find the k nearest neighbors of j, i.e., those that minimize d(xk, xj)Let f(xj)= g( yk ), e.g., majority label in yk

• Algorithm choices

• How many neighbors to consider (i.e., choice of k)?... Usually a small value is used, e.g. k<10

• What distance measure d( ) to use? ... Euclidean norm distance is often used

• What function g( ) to combine the neighbors’ labels into a prediction? ... Majority vote is often used

L2

Page 12: Data Mining & Machine Learning

1NN decision boundary

• For each training example i, we can calculate its Voronoi cell, which corresponds to the space of points for which i is their nearest neighbor

• All points in such a Voronoi cell are labeled by the class of the training point, forming a Voronoi tessellation of the feature space

Source: http://www.cs.bilkent.edu.tr/~saksoy/courses/cs551-Spring2008/slides/cs551_nonbayesian1.pdf

Page 13: Data Mining & Machine Learning

Nearest neighbor

• Strengths:

• Simple model, easy to implement

• Very efficient learning: O(1)

• Weaknesses:

• Inefficient inference: time and space O(n)

• Curse of dimensionality:

• As number of features increase, you need an exponential increase in the size of the data to ensure that you have nearby examples for any given data point

- See python notebook on sphere volume

Page 14: Data Mining & Machine Learning

k-NN learning

• Parameters of the model:

• k (number of neighbors)

• any parameters of distance measure (e.g., weights on features)

• Model space

• Possible tessellations of the feature space

• Search algorithm

• Implicit search: choice of k, d, and g uniquely define a tessellation

• Score function

• Majority vote is minimizing misclassification rate

Page 15: Data Mining & Machine Learning

Least Squares Classifier

Page 16: Data Mining & Machine Learning

! Given x features of a car (length, width, mpg, maximum speed,…)! Classify cars into categories based on x

16

Motivation

Page 17: Data Mining & Machine Learning

Least Squares ClassifierTwo classes, cars:! is a real-valued vector (features of car ) ! is the class of car

! Find linear discriminant weights w

! Score function least squares error

! Search function: find , that minimize score

nxi iyi i

yi = {1 economy−1 luxury

,  for i = 1,…, n

f(x) = wTx + b

score =n

∑i=1

(yi − f(xi))2

w b

Feature: Car max speed > 0f (x)

= 0f (x) < 0f (x)

f (x)∥x∥

−b

∥w∥

Feature: Car length

Page 18: Data Mining & Machine Learning

−4 −2 0 2 4 6 8

−8

−6

−4

−2

0

2

4

−4 −2 0 2 4 6 8

−8

−6

−4

−2

0

2

4

−4 −2 0 2 4 6 8

−8

−6

−4

−2

0

2

4

−4 −2 0 2 4 6 8

−8

−6

−4

−2

0

2

4

Least Squares Solution

18

Issues with Least Squares Classification

h+ε

ε

cares too much about well classified items

score =n

∑i=1

(yi − f(xi))2

Page 19: Data Mining & Machine Learning

Neural networks

• Analogous to biological systems

• Massive parallelism is computationally efficient

• First learning algorithm in 1959 (Rosenblatt)

• Perceptron learning rule

• Provide target outputs with inputs for a single neuron

• Incrementally update weights to learn to produce outputs

Page 20: Data Mining & Machine Learning

Neuron

26

CS590D 56

A Neuron

µk-

f

weighted

sum

Input

vector x

output y

Activation

function

weight

vector w

!

w0

w1

wn

x0

x1

xn

)sign(y

ExampleFor

n

0i

kii xw µ+= !=

CS590D 57

Multi-Layer Perceptron

Output nodes

Input nodes

Hidden nodes

Output vector

Input vector: xi

wij

! +=i

jiijj OwI θ

jIje

O−

+=

1

1

))(1( jjjjj OTOOErr −−=

jkk

kjjj wErrOOErr !−= )1(

ijijij OErrlww )(+=

jjj Errl)(+=θθ

b

PP

Activation function

f(x) = wT x + b

ParametersParameter (bias)

Page 21: Data Mining & Machine Learning

Logistic regression

! Task: Binary classification! Data representation: observations of attributes and label ! Knowledge representation: Two classes (y=0, y=1)

where

! Score function is the negative log-likelihood:

! Search: find , that minimize score

n xi ∈ ℝp y ∈ {0,1}

P(Y = 1 |X = x) = σ(wT x + b)

score = − log P({xi, yi}ni=1 |w, b)

= −n

∑i=1

1(yi = 1)log σ(wTxi + b) + 1(yi = 0)log(1 − σ(wTxi + b))

w b

Logistic function

Logistic (neuron) Activation (non-linear filter)

• If input is , the output willlook like a probability

• p(y = 1 | x; w) =

x

• We will represent the logistic function with the symbol:12

is one if a = b, zero otherwise1(a = b)

Page 22: Data Mining & Machine Learning

17

How to Deal with Multiple Classes?

Page 23: Data Mining & Machine Learning

• How to classify objects into multiple types?

18

Naïve Approach: one vs. many Classification

y(1)c =

(1 , if car c is “small”

�1 , if car c is “luxury”

y(2)c =

(1 , if car c is “small”

�1 , if car c is “medium”

y(3)c =

(1 , if car c is “medium”

�1 , if car c is “luxury”

Might work OK in some scenarios… but not clear in this case

Page 24: Data Mining & Machine Learning

luxury

19

Issue with using binary classifiers for K classes4.1. Discriminant Functions 183

R1

R2

R3

?

C1

not C1

C2

not C2

R1

R2

R3

?C1

C2

C1

C3

C2

C3

Figure 4.2 Attempting to construct a K class discriminant from a set of two class discriminants leads to am-biguous regions, shown in green. On the left is an example involving the use of two discriminants designed todistinguish points in class Ck from points not in class Ck. On the right is an example involving three discriminantfunctions each of which is used to separate a pair of classes Ck and Cj .

example involving three classes where this approach leads to regions of input spacethat are ambiguously classified.

An alternative is to introduce K(K − 1)/2 binary discriminant functions, onefor every possible pair of classes. This is known as a one-versus-one classifier. Eachpoint is then classified according to a majority vote amongst the discriminant func-tions. However, this too runs into the problem of ambiguous regions, as illustratedin the right-hand diagram of Figure 4.2.

We can avoid these difficulties by considering a single K-class discriminantcomprising K linear functions of the form

yk(x) = wTk x + wk0 (4.9)

and then assigning a point x to class Ck if yk(x) > yj(x) for all j "= k. The decisionboundary between class Ck and class Cj is therefore given by yk(x) = yj(x) andhence corresponds to a (D − 1)-dimensional hyperplane defined by

(wk − wj)Tx + (wk0 − wj0) = 0. (4.10)

This has the same form as the decision boundary for the two-class case discussed inSection 4.1.1, and so analogous geometrical properties apply.

The decision regions of such a discriminant are always singly connected andconvex. To see this, consider two points xA and xB both of which lie inside decisionregion Rk, as illustrated in Figure 4.3. Any point x̂ that lies on the line connectingxA and xB can be expressed in the form

x̂ = λxA + (1 − λ)xB (4.11)

small

medium

Uncertain classification region

Figure: C. Bishop

Page 25: Data Mining & Machine Learning

We will revisit multi-class classification when we see neural networks

Page 26: Data Mining & Machine Learning

Support vector machines (SVMs)

Page 27: Data Mining & Machine Learning

Support vector machines

• Discriminative model

• General idea:

• Find best boundary points (support vectors) and build classifier on top of them

• Linear and non-linear SVMs

Page 28: Data Mining & Machine Learning

Choosing hyperplanes to separate points

Source: Introduction to Data Mining, Tan, Steinbach, and Kumar

Page 29: Data Mining & Machine Learning

Among equivalent hyperplanes, choose one that maximizes “margin”

Source: Introduction to Data Mining, Tan, Steinbach, and Kumar

Page 30: Data Mining & Machine Learning

Linear SVMs

• Same functional form as perceptron

• Different learning procedure:Search for hyperplane with largest margin

• Margin=d+ + d- where d+ is distance to closest positive example and d- is distance to closest negative example

Linear Support Vector Machines

• d+ distance to closest positive example, d- distance to closest negative example.

• Define margin of separating hyperplane to be d++ d-

• SVMs look for hyperplane with the largest margin

o

x

o

oo

o

o

x

xx

x

x

x

d" d#

support +ectors

y = sign

"mX

i=1

wixi + b

#

Page 31: Data Mining & Machine Learning

Constrained optimization for SVMs

Eq1 : x(j) · w + b ⌅ +1 for y(j) = +1Eq2 : x(j) · w + b ⇤ �1 for y(j) = �1

Eq3 : y(j)(x(j) · w + b)� 1 ⌅ 0 ⇧y(j)

H1 : x(j) · w + b = +1H2 : x(j) · w + b = �1

d+ = d� =1

||w||

• Can maximize margin by minimizing ||w|| as it defines the hyperplanes

margin =2

||w||

Linear Support Vector Machines

• d+ distance to closest positive example, d- distance to closest negative example.

• Define margin of separating hyperplane to be d++ d-

• SVMs look for hyperplane with the largest margin

o

x

o

oo

o

o

x

xx

x

x

x

d" d#

support +ectors

H1

H2

Prediction constraint

Hyperplane boundaries

Page 32: Data Mining & Machine Learning

SVM optimization

• Search: Maximize margin by minimizing 0.5||w||2 subject to constraints on Eq3

• Note: Maximizing 2/||w|| is equivalent to minimizing 0.5||w||2

• Introduce Lagrange multipliers (α) for constraints into score function to minimize:

• Minimize LP with respect to w, b, and αN ≥0

• Convex programming problem

• Quadratic programming problem with parameters w, b, α

LP =12

||w||2 �I�

i=1

�iy(i)[x(i) · w + b] +I�

i=1

�i

Page 33: Data Mining & Machine Learning

Constrained optimization

• Linear programming (LP) is a technique for the optimization of a linear objective function, subject to linear constraints on the variables

• Quadratic programming (QP) is a technique for the optimization of a quadratic function of several variables, subject to linear constraints on these variables

Page 34: Data Mining & Machine Learning

SVM components

• Model space

• Set of weights w and b (hyperplane boundary)

• Search algorithm

• Quadratic programming to minimize Lp with constraints

• Score function

• Lp: maximizes margin subject to constraint that all training data is correctly classified

Page 35: Data Mining & Machine Learning

Limitations of linear SVMs

• Linear classifiers cannot deal with:

• Non-linear concepts

• Noisy data

• Solutions:

• Soft margin (e.g., allow mistakes in training data)

• Network of simple linear classifiers (e.g., neural networks)

• Map data into richer feature space (e.g., non-linear features) and then use linear classifier