nearest neighbors and naive bayes

35
Mario Martin Nearest Neighbors and Naive Bayes

Upload: others

Post on 29-Nov-2021

15 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Nearest Neighbors and Naive Bayes

Mario Martin

Nearest Neighbors andNaive Bayes

Page 2: Nearest Neighbors and Naive Bayes

Mario Martin

Simple algorithms but effectiveTwo different methods:

Nearest Neighbor. Non parametric method: In this case a lazy Instance Based Learning method that does not build any model. Naïve Bayes. Parametric: It builds a probabilistic model of your data following some assumptions. 

Baseline algorithms

Naive Bayes and Nearest Neighbor                                                 (10/2018)

Page 3: Nearest Neighbors and Naive Bayes

Mario Martin

Nearest Neighbor classifier

Instance Based Learning / Lazy Methods

Page 4: Nearest Neighbors and Naive Bayes

Mario Martin

Lazy learning methods: they don’t build a model of the data

Assign the label to an observation depending on the labels of “closest” examples

Only requirements:A training setA similarity measure

Instance Based Learning

Naive Bayes and Nearest Neighbor                                                 (10/2018)

Page 5: Nearest Neighbors and Naive Bayes

Mario Martin

K‐NNDistance Weighted kNNHow to select K??How to solve some problems

Instance Based Learning Algorithms

Naive Bayes and Nearest Neighbor                                                 (10/2018)

Page 6: Nearest Neighbors and Naive Bayes

Mario Martin

K‐Nearest neighbor algorithmIt interprets each example as a point in a space defined by the features describing the dataIn that space a similarity measure allows as to classify new examples. Class is assigned depending on the K closest examples

K‐NN

Naive Bayes and Nearest Neighbor                                                 (10/2018)

Page 7: Nearest Neighbors and Naive Bayes

1‐NN example

• Two real features  (x1, x2) define the space. • Each red point is a positive example. Black points are negative examples

Equivalen to to draw the Voronoi space of yourdata.

Page 8: Nearest Neighbors and Naive Bayes

1‐NN example

• Two real features  (x1, x2) define the space. • Each red point is a positive example. Black points are negative examples

x

query point qf

nearest neighbor qi

Equivalen to to draw the Voronoi space of yourdata.

X new data is classified as positive

Page 9: Nearest Neighbors and Naive Bayes

Mario Martin

Distance is a parameter of the algorithmWhen dataset is numeric, usually Euclidean:

In mixed data sets, Gower or any other appropriate distance measureCAVEAT: Data should be normalized or standardized in order to ensure same relevance to each feature in the computation of distance.

Distance measures

Naive Bayes and Nearest Neighbor                                                 (10/2018)

Page 10: Nearest Neighbors and Naive Bayes

Mario Martin

Advantages:Fast trainingAbility to learn very complex functions

Problems:Very slow in testing. Needed some smart structure representation of data in treesFooled by noiseFooled when irrelevant features

Some comments

Naive Bayes and Nearest Neighbor                                                 (10/2018)

Page 11: Nearest Neighbors and Naive Bayes

Mario Martin

Building more robust classifiersResults do not depend on the closest example but on the k clossest examples (so k‐nearest neighbours(kNN) name)

Some comments:

Naive Bayes and Nearest Neighbor                                                 (10/2018)

Page 12: Nearest Neighbors and Naive Bayes

Mario Martin

3‐Nearest Neighbors

query point qf

3 nearest neighbors

2x,1o

Naive Bayes and Nearest Neighbor                                                 (10/2018)

Page 13: Nearest Neighbors and Naive Bayes

Mario Martin

7‐Nearest Neighbors

query point qf

7 nearest neighbors

3x,4o

Naive Bayes and Nearest Neighbor                                                 (10/2018)

Page 14: Nearest Neighbors and Naive Bayes

Mario Martin

Parameters:Natural number k (odd number)Training setDistance measure

Algorithm:1. Store all training set <xi, label(xi)>2. Given new observation, xq, compute the nearest k 

neighbors3. Let vote the nearest k neighbors to assign the label to 

the new data.

K‐NN algorithm

Naive Bayes and Nearest Neighbor                                                 (10/2018)

Page 15: Nearest Neighbors and Naive Bayes

Mario Martin

High number of k show two advantages:Smother frontiersReduces sensibility to noise

But too large values are bad becauseWe loose locality in the decision because very distant points can interfere in assigning labelsComputation time is increased

K‐value usually is chosen by cross‐validation.

How to select k?

Naive Bayes and Nearest Neighbor                                                 (10/2018)

Page 16: Nearest Neighbors and Naive Bayes

Mario Martin

A smart variation of KNN.When voting, all k neighbors have the same influence, but some of them are more distant than the others (so the should influence less in decisions)

k=52 votes3 votes

Solution: Given more weight to closest examples

Distance Weighted kNN

Naive Bayes and Nearest Neighbor                                                 (10/2018)

Page 17: Nearest Neighbors and Naive Bayes

Mario Martin

Lets define a weigth for each of the k‐closest examples:

where xq is the query point, xi is the i‐closest example, d is the distance function and K is the kernel (a decreasing function with respect to distance function)

Predicted label for xq is computed according to:

where l(xi) is {‐1,1} the label of exemple xi, and wi is the weight of example xi

In previous example, it could be something like:

Distance Weighted kNN

Naive Bayes and Nearest Neighbor                                                 (10/2018)

Page 18: Nearest Neighbors and Naive Bayes

Mario Martin

Kernel functions

Examples of kernel functions

Naive Bayes and Nearest Neighbor                                                 (10/2018)

Page 19: Nearest Neighbors and Naive Bayes

Mario Martin

K‐NN is fooled when irrelevant features are widely present in the data set 

For instance, examples are described using 20 attributes, but only 2 of them are relevant to the classification… 

Solution consists in feature selection. For instance:Use weighted distance: 

Limit weights to 0 and 1. Notice that setting zj = 0 means removing the featureFind weights z1,…,zn (one for each feature), that minimize error in a validation data set using cross‐validation 

Problems with irrelevant features

Naive Bayes and Nearest Neighbor                                                 (10/2018)

Page 20: Nearest Neighbors and Naive Bayes

Mario Martin

Naïve Bayes

Probabilistic model

Page 21: Nearest Neighbors and Naive Bayes

Mario Martin

From examples in the dataset, we can estimate the likelihood of our data:

read as probability to observe example with features (x1,x2... xn) [xi, represents feature i of observation x] in class ci

But, for classifying an observation (x1,x2... xn), we should look for the class that maximizes the probability of the observation belonging to the class:

Naive Bayes basics

1 2argmax ( | , , , )j

MAP j nc C

c P c x x x

Naive Bayes and Nearest Neighbor                                                 (10/2018)

Page 22: Nearest Neighbors and Naive Bayes

Mario Martin

We will use the Bayes’ theorem:

Bayes’ theorem :

),,,|(argmax 21 njCc

MAP xxxcPcj

Naïve Bayes classifiers

),,,()()|,,,(

argmax21

21

n

jjn

Cc xxxPcPcxxxP

j

Naive Bayes and Nearest Neighbor                                                 (10/2018)

Page 23: Nearest Neighbors and Naive Bayes

Mario Martin

P(cj) - Simply proportion of elements in class jP(x1,x2,…,xn|cj)

Problem |X|n.|C| parameters!It can only be estimated from a very huge dataset. Impractical

Solution: Independence assumption ( very Naïve) : attribute values are independent. So in this case, we can easily compute

Computing probabilities

1 2( , , , | ) ( | )n j i ji

P x x x c P x c

Naive Bayes and Nearest Neighbor                                                 (10/2018)

Page 24: Nearest Neighbors and Naive Bayes

Mario Martin

P(xk|cj) Now we only need n.|C| probability estimations

Very easy. Number of values with property xk in class cj over the complet number of cases in class cj

Solving now,the class assigned to a new observation is:

Computing probabilities

1 2( , , , | ) ( | )n j i ji

P x x x c P x c

1 2

1 2

( , , , | ) ( )argmax arg max ( ) ( | )

( , , , )j j

n j jNB j i j

c C c C in

P x x x c P cc P c P x c

P x x x

Equation to be used

Naive Bayes and Nearest Neighbor                                                 (10/2018)

Page 25: Nearest Neighbors and Naive Bayes

Mario Martin

Being probabilities in range 0..1, products quickly leadto floating‐point underflow errorsKnowing that log(xy) = log(x) + log(y), it is better to work with log(p) than with probabilities.

Now:

Practical issues

j

argmax log ( ) log ( | )NB j i jc C i positions

c P c P x c

Naive Bayes and Nearest Neighbor                                                 (10/2018)

Page 26: Nearest Neighbors and Naive Bayes

Mario Martin

Training set: X document corpusEach document is labeled with f(x)=like/dislikeGoal: Learn function that permits given new document if you like it or not.Questions:How do we represent documents? How to compute probabilities?

Example: Learning to classify texts

Naive Bayes and Nearest Neighbor                                                 (10/2018)

Page 27: Nearest Neighbors and Naive Bayes

Mario Martin

How do we represent documents? Each document is represented as a Bag of WordsAttributes: All words that appear in the documentSo each document is represented as a booleanvector with length N: 0 – word does not appear ; 1 – word appear

Practical problem: A very huge table.Solution : Use sparse representation of matrixes

Example: Learning to classify texts

Naive Bayes and Nearest Neighbor                                                 (10/2018)

Page 28: Nearest Neighbors and Naive Bayes

Mario Martin

Some numbers10.000 documents500 words per documentMaximum theoretical number of words: 50.000 (much less because of word repetitions)

Reducing the number of attributesRemoving the number (sing/plural) and verbal forms (stemming)Remove conjunctions, propositions and articles (stop words)Now we have about. 10.000 attributes

Example: Learning to classify texts

Naive Bayes and Nearest Neighbor                                                 (10/2018)

Page 29: Nearest Neighbors and Naive Bayes

Mario Martin

How to compute probabilities?First compute for each class [“a priori” probability for like anddislike classes]

Example: Learning to classify texts

Naive Bayes and Nearest Neighbor                                                 (10/2018)

NB i iiv {like,dislike}

v argmax P(v) P(x word | v)

like#documents like P(v ) =

total number of documents

dislike#documents dislike P(v ) =

total number of documents

i P(v )

Page 30: Nearest Neighbors and Naive Bayes

Mario Martin

How to compute probabilities?Second, compute for each word:

Number of parameters to estimate is not too large: 10.000 wordsand two classes (so about 20.000)

Example: Learning to classify texts

Naive Bayes and Nearest Neighbor                                                 (10/2018)

NB i iiv {like,dislike}

v argmax P(v) P(x word | v)

i iP(x word | v)

k kk

# (docs. v in training where word apears) nP(word | v) #(documents v) n

Page 31: Nearest Neighbors and Naive Bayes

• Problem:

–When nk is low, not an accurate probability–when nk is 0 for wordk for one class v, then any document with that word will never be assigned to v (independent of other appearing words)

k kk

# (docs. v del training on word apareix) nP(word | v) #(documents v) n

Example: Learning to classify texts

Page 32: Nearest Neighbors and Naive Bayes

• Solution: More robust computation of probabilities (Laplace smoothing)

• Where:– nk is # of documents of class v in which word k appear – n  is # of documents with label v– p it’s a likelihood estimation of “a priori” P(xk|v) (f.i.,  uniform distribution)

– m  is the number of labels

k

kn mpP(word | v) n m

Example: Learning to classify texts

Page 33: Nearest Neighbors and Naive Bayes

Example: Learning to classify texts

Smoothing:

More common “a priori” uniform distribution: 1. When two classes: p=1/2, m=2 (Laplace Rule)

2. Generic case (c classes): p = 1/c, m=c

k

kn 1P(x | v) n 2

k

kn mpP(x | v)

n m

c

kk

n 1P(x | v) n

Page 34: Nearest Neighbors and Naive Bayes

Mario Martin

Naïve Bayes return good accuracy results even when independence assumption is not fulfilled In fact, Spam/not Spam implementation of Thunderbird work in this wayApplied to document filtering (fi. Newsgroups or incoming mails)

Learning and testing time are linear with the number of attributes!

Example: Learning to classify texts

Naive Bayes and Nearest Neighbor                                                 (10/2018)

Page 35: Nearest Neighbors and Naive Bayes

Mario Martin

Assume each class follows a normal distribution for each variable

For instance 73 is average of feature temp. for class x, and std=6.2, we compute conditional prob in the following way: 

Extension to continuous attributes

Naive Bayes and Nearest Neighbor                                                 (10/2018)