classification nearest neighbor

17
Jeff Howbert Introduction to Machine Learning Winter 2012 1 Classification Nearest Neighbor

Upload: neorah

Post on 15-Jan-2016

67 views

Category:

Documents


0 download

DESCRIPTION

Classification Nearest Neighbor. Instance based classifiers. Store the training samples Use training samples to predict the class label of unseen samples. Instance based classifiers. Examples: Rote learner memorize entire training data - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Classification Nearest Neighbor

Jeff Howbert Introduction to Machine Learning Winter 2012 1

Classification

Nearest Neighbor

Page 2: Classification Nearest Neighbor

Jeff Howbert Introduction to Machine Learning Winter 2012 2

Instance based classifiers

Atr1 ……... AtrN ClassA

B

B

C

A

C

B

Set of Stored Cases

Atr1 ……... AtrN

Unseen Case

• Store the training samples

• Use training samples to predict the class label of unseen samples

Page 3: Classification Nearest Neighbor

Jeff Howbert Introduction to Machine Learning Winter 2012 3

Examples:

– Rote learner memorize entire training data perform classification only if attributes of test sample match one of the training samples exactly

– Nearest neighbor use k “closest” samples (nearest neighbors) to perform classification

Instance based classifiers

Page 4: Classification Nearest Neighbor

Jeff Howbert Introduction to Machine Learning Winter 2012 4

Basic idea:

– If it walks like a duck, quacks like a duck, then it’s probably a duck

Nearest neighbor classifiers

training samples

test sample

compute distance

choose k of the “nearest” samples

Page 5: Classification Nearest Neighbor

Jeff Howbert Introduction to Machine Learning Winter 2012 5

Unknown record

Nearest neighbor classifiers

Requires three inputs:

1. The set of stored samples

2. Distance metric to compute distance between samples

3. The value of k, the number of nearest neighbors to retrieve

Page 6: Classification Nearest Neighbor

Jeff Howbert Introduction to Machine Learning Winter 2012 6

Nearest neighbor classifiers

To classify unknown record:

1. Compute distance to other training records

2. Identify k nearest neighbors

3. Use class labels of nearest neighbors to determine the class label of unknown record (e.g., by taking majority vote)

Unknown record

Page 7: Classification Nearest Neighbor

Jeff Howbert Introduction to Machine Learning Winter 2012 7

Definition of nearest neighbor

X X X

(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor

k-nearest neighbors of a sample x are datapoints that have the k smallest distances to x

Page 8: Classification Nearest Neighbor

Jeff Howbert Introduction to Machine Learning Winter 2012 8

1-nearest neighbor

Voronoi diagram

Page 9: Classification Nearest Neighbor

Jeff Howbert Introduction to Machine Learning Winter 2012 9

Compute distance between two points:

– Euclidean distance

Options for determining the class from nearest neighbor list

– Take majority vote of class labels among the k-nearest neighbors

– Weight the votes according to distance example: weight factor w = 1 / d2

Nearest neighbor classification

i

ii yxd 2)(),( yx

Page 10: Classification Nearest Neighbor

Jeff Howbert Introduction to Machine Learning Winter 2012 10

Choosing the value of k:– If k is too small, sensitive to noise points

– If k is too large, neighborhood may include points from other classes

Nearest neighbor classification

X

Page 11: Classification Nearest Neighbor

Jeff Howbert Introduction to Machine Learning Winter 2012 11

Scaling issues

– Attributes may have to be scaled to prevent distance measures from being dominated by one of the attributes

– Example: height of a person may vary from 1.5 m to 1.8 m weight of a person may vary from 90 lb to 300 lb income of a person may vary from $10K to $1M

Nearest neighbor classification

Page 12: Classification Nearest Neighbor

Jeff Howbert Introduction to Machine Learning Winter 2012 12

Problem with Euclidean measure:

– High dimensional data curse of dimensionality

– Can produce counter-intuitive results

Nearest neighbor classification…

1 1 1 1 1 1 1 1 1 1 1 0

0 1 1 1 1 1 1 1 1 1 1 1

1 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 1vs

d = 1.4142 d = 1.4142

one solution: normalize the vectors to unit length

Page 13: Classification Nearest Neighbor

Jeff Howbert Introduction to Machine Learning Winter 2012 13

k-Nearest neighbor classifier is a lazy learner

– Does not build model explicitly.

– Unlike eager learners such as decision tree induction and rule-based systems.

– Classifying unknown samples is relatively expensive.

k-Nearest neighbor classifier is a local model, vs. global model of linear classifiers.

Nearest neighbor classification

Page 14: Classification Nearest Neighbor

Jeff Howbert Introduction to Machine Learning Winter 2012 14

PEBLS: Parallel Examplar-Based Learning System (Cost & Salzberg)

– Works with both continuous and nominal featuresFor nominal features, distance between two nominal values is computed using modified value difference metric (MVDM)

– Each sample is assigned a weight factor

– Number of nearest neighbor, k = 1

Example: PEBLS

Page 15: Classification Nearest Neighbor

Jeff Howbert Introduction to Machine Learning Winter 2012 15

Example: PEBLS

ClassRefund

Yes No

Yes 0 3

No 3 4

ClassMarital Status

Single Married Divorced

Yes 2 0 1

No 2 4 1

i

ii

n

n

n

nVVd

2

2

1

121 ),(

Distance between nominal attribute values:

d(Single,Married)

= | 2/4 – 0/4 | + | 2/4 – 4/4 | = 1

d(Single,Divorced)

= | 2/4 – 1/2 | + | 2/4 – 1/2 | = 0

d(Married,Divorced)

= | 0/4 – 1/2 | + | 4/4 – 1/2 | = 1

d(Refund=Yes,Refund=No)

= | 0/3 – 3/7 | + | 3/3 – 4/7 | = 6/7

Tid Refund MaritalStatus

TaxableIncome Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes10

Page 16: Classification Nearest Neighbor

Jeff Howbert Introduction to Machine Learning Winter 2012 16

Tid Refund Marital Status

Taxable Income Cheat

X Yes Single 125K No

Y No Married 100K No 10

Example: PEBLS

d

iiiYX YXdwwYX

1

2),(),(

Distance between record X and record Y:

where:

correctly predicts X timesofNumber

predictionfor used is X timesofNumber Xw

wX 1 if X makes accurate prediction most of the time

wX > 1 if X is not reliable for making predictions

Page 17: Classification Nearest Neighbor

Jeff Howbert Introduction to Machine Learning Winter 2012 17

Decision boundaries in global vs. local models

linear regression

• global• stable• can be inaccurate

15-nearest neighbor 1-nearest neighbor

• local• accurate• unstable

What ultimately matters: GENERALIZATION