nearest neighbors and naive bayes

Mario Martin

Nearest Neighbors andNaive Bayes

Mario Martin

Simple algorithms but effectiveTwo different methods:

Nearest Neighbor. Non parametric method: In this case a lazy Instance Based Learning method that does not build any model. Naïve Bayes. Parametric: It builds a probabilistic model of your data following some assumptions.

Baseline algorithms

Naive Bayes and Nearest Neighbor (10/2018)

Mario Martin

Nearest Neighbor classifier

Instance Based Learning / Lazy Methods

Mario Martin

Lazy learning methods: they don’t build a model of the data

Assign the label to an observation depending on the labels of “closest” examples

Only requirements:A training setA similarity measure

Instance Based Learning


Mario Martin

K‐NNDistance Weighted kNNHow to select K??How to solve some problems

Instance Based Learning Algorithms


Mario Martin

K‐Nearest neighbor algorithmIt interprets each example as a point in a space defined by the features describing the dataIn that space a similarity measure allows as to classify new examples. Class is assigned depending on the K closest examples

K‐NN


1‐NN example

• Two real features (x1, x2) define the space. • Each red point is a positive example. Black points are negative examples

Equivalen to to draw the Voronoi space of yourdata.

1‐NN example

• Two real features (x1, x2) define the space. • Each red point is a positive example. Black points are negative examples

x

query point qf

nearest neighbor qi

Equivalen to to draw the Voronoi space of yourdata.

X new data is classified as positive

Mario Martin

Distance is a parameter of the algorithmWhen dataset is numeric, usually Euclidean:

In mixed data sets, Gower or any other appropriate distance measureCAVEAT: Data should be normalized or standardized in order to ensure same relevance to each feature in the computation of distance.

Distance measures


Mario Martin

Advantages:Fast trainingAbility to learn very complex functions

Problems:Very slow in testing. Needed some smart structure representation of data in treesFooled by noiseFooled when irrelevant features

Some comments


Mario Martin

Building more robust classifiersResults do not depend on the closest example but on the k clossest examples (so k‐nearest neighbours(kNN) name)

Some comments:


Mario Martin

3‐Nearest Neighbors

query point qf

3 nearest neighbors

2x,1o


Mario Martin

7‐Nearest Neighbors

query point qf

7 nearest neighbors

3x,4o


Mario Martin

Parameters:Natural number k (odd number)Training setDistance measure

Algorithm:1. Store all training set <xi, label(xi)>2. Given new observation, xq, compute the nearest k

neighbors3. Let vote the nearest k neighbors to assign the label to

the new data.

K‐NN algorithm


Mario Martin

High number of k show two advantages:Smother frontiersReduces sensibility to noise

But too large values are bad becauseWe loose locality in the decision because very distant points can interfere in assigning labelsComputation time is increased

K‐value usually is chosen by cross‐validation.

How to select k?


Mario Martin

A smart variation of KNN.When voting, all k neighbors have the same influence, but some of them are more distant than the others (so the should influence less in decisions)

k=52 votes3 votes

Solution: Given more weight to closest examples

Distance Weighted kNN


Mario Martin

Lets define a weigth for each of the k‐closest examples:

where xq is the query point, xi is the i‐closest example, d is the distance function and K is the kernel (a decreasing function with respect to distance function)

Predicted label for xq is computed according to:

where l(xi) is {‐1,1} the label of exemple xi, and wi is the weight of example xi

In previous example, it could be something like:

Distance Weighted kNN


Mario Martin

Kernel functions

Examples of kernel functions


Mario Martin

K‐NN is fooled when irrelevant features are widely present in the data set

For instance, examples are described using 20 attributes, but only 2 of them are relevant to the classification…

Solution consists in feature selection. For instance:Use weighted distance:

Limit weights to 0 and 1. Notice that setting zj = 0 means removing the featureFind weights z1,…,zn (one for each feature), that minimize error in a validation data set using cross‐validation

Problems with irrelevant features


Mario Martin

Naïve Bayes

Probabilistic model

Mario Martin

From examples in the dataset, we can estimate the likelihood of our data:

read as probability to observe example with features (x1,x2... xn) [xi, represents feature i of observation x] in class ci

But, for classifying an observation (x1,x2... xn), we should look for the class that maximizes the probability of the observation belonging to the class:

Naive Bayes basics

1 2argmax ( | , , , )j

MAP j nc C

c P c x x x


Mario Martin

We will use the Bayes’ theorem:

Bayes’ theorem :

),,,|(argmax 21 njCc

MAP xxxcPcj

Naïve Bayes classifiers

),,,()()|,,,(

argmax21

21

n

jjn

Cc xxxPcPcxxxP

j


Mario Martin

P(cj) - Simply proportion of elements in class jP(x1,x2,…,xn|cj)

Problem |X|n.|C| parameters!It can only be estimated from a very huge dataset. Impractical

Solution: Independence assumption ( very Naïve) : attribute values are independent. So in this case, we can easily compute

Computing probabilities

1 2( , , , | ) ( | )n j i ji

P x x x c P x c


Mario Martin

P(xk|cj) Now we only need n.|C| probability estimations

Very easy. Number of values with property xk in class cj over the complet number of cases in class cj

Solving now,the class assigned to a new observation is:

Computing probabilities

1 2( , , , | ) ( | )n j i ji

P x x x c P x c

1 2

1 2

( , , , | ) ( )argmax arg max ( ) ( | )

( , , , )j j

n j jNB j i j

c C c C in

P x x x c P cc P c P x c

P x x x

Equation to be used


Mario Martin

Being probabilities in range 0..1, products quickly leadto floating‐point underflow errorsKnowing that log(xy) = log(x) + log(y), it is better to work with log(p) than with probabilities.

Now:

Practical issues

j

argmax log ( ) log ( | )NB j i jc C i positions

c P c P x c


Mario Martin

Training set: X document corpusEach document is labeled with f(x)=like/dislikeGoal: Learn function that permits given new document if you like it or not.Questions:How do we represent documents? How to compute probabilities?

Example: Learning to classify texts


Mario Martin

How do we represent documents? Each document is represented as a Bag of WordsAttributes: All words that appear in the documentSo each document is represented as a booleanvector with length N: 0 – word does not appear ; 1 – word appear

Practical problem: A very huge table.Solution : Use sparse representation of matrixes



Mario Martin

Some numbers10.000 documents500 words per documentMaximum theoretical number of words: 50.000 (much less because of word repetitions)

Reducing the number of attributesRemoving the number (sing/plural) and verbal forms (stemming)Remove conjunctions, propositions and articles (stop words)Now we have about. 10.000 attributes



Mario Martin

How to compute probabilities?First compute for each class [“a priori” probability for like anddislike classes]



NB i iiv {like,dislike}

v argmax P(v) P(x word | v)

like#documents like P(v ) =

total number of documents

dislike#documents dislike P(v ) =

total number of documents

i P(v )

Mario Martin

How to compute probabilities?Second, compute for each word:

Number of parameters to estimate is not too large: 10.000 wordsand two classes (so about 20.000)



NB i iiv {like,dislike}

v argmax P(v) P(x word | v)

i iP(x word | v)

k kk

# (docs. v in training where word apears) nP(word | v) #(documents v) n

• Problem:

–When nk is low, not an accurate probability–when nk is 0 for wordk for one class v, then any document with that word will never be assigned to v (independent of other appearing words)

k kk

# (docs. v del training on word apareix) nP(word | v) #(documents v) n


• Solution: More robust computation of probabilities (Laplace smoothing)

• Where:– nk is # of documents of class v in which word k appear – n is # of documents with label v– p it’s a likelihood estimation of “a priori” P(xk|v) (f.i., uniform distribution)

– m is the number of labels

k

kn mpP(word | v) n m



Smoothing:

More common “a priori” uniform distribution: 1. When two classes: p=1/2, m=2 (Laplace Rule)

2. Generic case (c classes): p = 1/c, m=c

k

kn 1P(x | v) n 2

k

kn mpP(x | v)

n m

c

kk

n 1P(x | v) n

Mario Martin

Naïve Bayes return good accuracy results even when independence assumption is not fulfilled In fact, Spam/not Spam implementation of Thunderbird work in this wayApplied to document filtering (fi. Newsgroups or incoming mails)

Learning and testing time are linear with the number of attributes!



Mario Martin

Assume each class follows a normal distribution for each variable

For instance 73 is average of feature temp. for class x, and std=6.2, we compute conditional prob in the following way:

Extension to continuous attributes


nearest neighbors and naive bayes

Documents