lecture7 - ibk

Introduction to MachineIntroduction to Machine LearningLearning

Lecture 7Lecture 7 Instance Based Learning

Albert Orriols i Puigi l @ ll l [email protected]

Artificial Intelligence – Machine LearningEnginyeria i Arquitectura La Salleg y q

Universitat Ramon Llull

Recap of Lecture 6

LET’S START WITH DATA CLASSIFICATIONCLASSIFICATION

Slide 2Artificial Intelligence Machine Learning

Recap of Lecture 6

Data Set Classification Model How?

We are going to deal with:

• Data described by nominal and continuous attributes

• Data that may have instances with missing values


Recap of Lecture 6We want to build decision trees

How can I automatically generate these typesgenerate these types of trees?

Decide which attribute weDecide which attribute weshould put in each node

Decide a split pointDecide a split point

Rely on information theory

We also saw many other improvements


Today’s Agenda

Classification without building a modelK-Nearest Neighbor (kNN)Effect of KDistance functionsDistance functionsVariants of K-NNStrengths and weaknesses


Classification without Building a Model

Forget about a global model!g gSimply store all the training examples

B ild l l d l f h t t i tBuild a local model for each new test instance

Refered to as lazy learners

Some approaches to IBLSome approaches to IBLNearest neighbors

Locally weighted regression

Case-based reasoning


k-Nearest NeighborsAlgorithmg

Store all the training data

Gi t t i tGiven a new test instanceRecover the k neighbors of the test instanceP di t th j it l th i hbPredict the majority class among the neighbors

Voronoi Cells: The feature space isdecomposed into several cells.

E.g. for k=1


k-Nearest NeighborsBut, where is the learning process?, g p

Select the k neighbors and return the majority class is learning?

N th t’ j t t i iNo, that’s just retrieving

But still, some important issuesWhich k should I use?Which k should I use?

Which distance functions should I use?

Should I maintain all instances of the training data set?


Which k Should I Use?The effect of k

15-NN 1-NN

Do you remember the discussion about overfitting in C4.5?

Slide 9

Apply the same concepts here!

Artificial Intelligence Machine Learning

Which k Should I Use?Some experimental results on the use of different kp

7-NN

Notice that the test error decreases as k increases but at k ≈ 5-

Number of neighbors

Notice that the test error decreases as k increases, but at k ≈ 5-7, it starts increasing again

Rule of thumb: k=3 k=5 and k=7 seem to work ok in the

Slide 10

Rule of thumb: k=3, k=5, and k=7 seem to work ok in the majority of problems


Distance FunctionsDistance functions must be able to

Nominal attributes

C ti tt ib tContinuous attributes

Missing values

The keyThey must return a low value for similar objects and a highThey must return a low value for similar objects and a high value for different objects

Seems obvious right? But still it is domain dependentSeems obvious, right? But still, it is domain dependent

There are many of them. Let’s see some of the most usedused


Distance FunctionsDistance between two points in the same spacep p

d(x, y)

Some properties expected to be satisfied in generald(x, y) ≥ 0 and d(x, x) = 0

d(x y) = d(y x)d(x, y) = d(y, x)

d(x, y) + d(y, z) ≥ d(x, z)


Distances for Continuous Variables

Given x=(x1,…,xn)’ and y=(y1,…,yn)’1 n 1 n

Euclidean ∑ −=n

yxyxd 2/12 ])([)(Euclidean ∑=

=i

iiE yxyxd1

])([),(

Minkowsky ∑ −=n

qqyxyxd /1])([)(Minkowsky ∑=i

iiE yxyxd1

])([),(

Distance absolute value ∑ −=n

iiABS yxyxd ||),( ∑=i

iiABS yy1

||),(



What if attributes are measured over different scales?Attribute 1 ranging in [0,1]

Attribute 2 ranging in [0 1000]Attribute 2 ranging in [0, 1000]

Can you detect any potential problem in the aforementioned distance functions?distance functions?


X in [0,1], y in [0,1000] X in [0,1000], y in [0,1000]


The larger the scale, the larger the influence of the g , gattribute in the distance function

Solution: Normalize each attributeSolution: Normalize each attribute

How:Normalization by means of the range

aa exexd )(

aa

aaa

exexdexexdnorm minmax

),(),( 2121 −

=

Normalization by means of the standard deviation

aaaa

aexexdexexd

norm σ4),(),( 21

21 =


aσ4

Distances for Nominal Attributes

Several metrics to deal with nominal attributesOverlap distance function

Idea: Two nominal attributes are equal only if they have the same value


Distances for Nominal Attributes

Several metrics to deal with nominal attributesValue difference metric (VDM)

C = number of classesP(a ex a c) = conditional probabilityP(a, exi , c) = conditional probability that the output class is c given that the attribute a has de value exi

a.

Idea: Two nominal values are similar if they have more similar correlations with the output classes

Slide 17

See (Wilson & Martinez) for more distance functions


Distances for Heterogeneous Attributes

What if my data set is described by both nominal and continuous attributes?continuous attributes?

Apply the same distance function

Use nominal distance functions for nominal attributes

Use continuous distance function for continuous attributes


Variants of kNN

Different variants of kNN Distance-weighted kNN

Attribute-weighted kNN


Distance-Weighted kNNInference of original kNNg

The k nearest neighbors vote for the class

Shouldn’t the closest examples have a higher influence in theShouldn t the closest examples have a higher influence in the decision process?

Weight the contribution of each of the k neighbors wrt their distanceWeight the contribution of each of the k neighbors wrt their distance

E.g.,))((maxarg)(ˆ k

xfvwxf = ∑ δ k

2

1

)(1

))(,(maxarg)(

i

iii

Vvq

dwwhere

xfvwxf

=

= ∑=∈

δ

∑

∑== k

i

iii

q

w

xfwxf 1

)()(ˆ

2),( iqi xxd ∑

=iiw

1

More robust to noisy instances and outliers

E.g.: Shepard’s method (Shepard,1968)


Attribute-weighted kNNWhat if some attributes are irrelevant or misleading?g

If irrelevant cost increases, but accuracy is not affected

If i l di t i d dIf misleading cost increases and accuracy may decrease

Weight attributes:

∑n

d 2)()( ∑=

−=i

iiiw yxwyxd1

2)(),(

How to determine the weights?Option 1: The expert provide us with the weightsp p p g

Option 2: Use a machine learning approach

More will be said in the next lecture!

Slide 21

More will be said in the next lecture!


Strengths and WeaknessesStrengths of kNN

Building of a new local model for each test instance

Learning has no costLearning has no cost

Empirical results show that the method is highly accurate w.r.t other machine learning techniquesmachine learning techniques

WeaknessesRetrieving approach, but does not learn

No global model. The knowledge is not legible

Test cost increases linearly with the input instances

No generalizationNo generalization

Curse of dimensionality: What happens if we have many attributes?

Slide 22

Noise and outliers may have a very negative effect


Next Class

From instance-based to case-based reasoning

A little bit more on learningDistance functions

Prototype selection


Introduction to MachineIntroduction to Machine LearningLearning

Lecture 7Lecture 7 Instance Based Learning

Albert Orriols i Puigi l @ ll l [email protected]

Artificial Intelligence – Machine LearningEnginyeria i Arquitectura La Salleg y q

Universitat Ramon Llull

lecture7 - ibk

Education

new t t i t b ild

learning process

new t t instance gitest

n th t j t

distance functions variants

recap of lecture

continuous attributes

data setclassification