interpreting weighted knn , forms of clustering, decision trees and b ayesian i nference

1

Peter Fox

Data Analytics – ITWS-4963/ITWS-6965

Week 7a, March 3, 2014, SAGE 3101

Interpreting weighted kNN, forms of clustering, decision trees and

Bayesian inference

Contents

2

Weighted KNNrequire(kknn)

data(iris)

m <- dim(iris)[1]

val <- sample(1:m, size = round(m/3), replace = FALSE, prob = rep(1/m, m))

iris.learn <- iris[-val,] # train

iris.valid <- iris[val,] # test

iris.kknn <- kknn(Species~., iris.learn, iris.valid, distance = 1, kernel = "triangular") # Possible choices are "rectangular" (which is standard unweighted knn), "triangular", "epanechnikov" (or beta(2,2)), "biweight" (or beta(3,3)), "triweight" (or beta(4,4)), "cos", "inv", "gaussian", "rank" and "optimal".

3

names(iris.kknn)• fitted.values Vector of predictions.• CL Matrix of classes of the k nearest neighbors.• W Matrix of weights of the k nearest neighbors.• D Matrix of distances of the k nearest neighbors.• C Matrix of indices of the k nearest neighbors.• prob Matrix of predicted class probabilities.• response Type of response variable, one of

continuous, nominal or ordinal.• distance Parameter of Minkowski distance.• call The matched call.• termsThe 'terms' object used. 4

Look at the output> head(iris.kknn$W)

[,1] [,2] [,3] [,4] [,5] [,6] [,7]

[1,] 0.4493696 0.2306555 0.1261857 0.1230131 0.07914805 0.07610159 0.014184110

[2,] 0.7567298 0.7385966 0.5663245 0.3593925 0.35652546 0.24159191 0.004312408

[3,] 0.5958406 0.2700476 0.2594478 0.2558161 0.09317996 0.09317996 0.042096849

[4,] 0.6022069 0.5193145 0.4229427 0.1607861 0.10804205 0.09637177 0.055297983

[5,] 0.7011985 0.6224216 0.5183945 0.2937705 0.16230921 0.13964231 0.053888244

[6,] 0.5898731 0.5270226 0.3273701 0.1791715 0.15297478 0.08446215 0.010180454

5

Look at the output> head(iris.kknn$D)

[,1] [,2] [,3] [,4] [,5] [,6] [,7]

[1,] 0.7259100 1.0142464 1.1519716 1.1561541 1.2139825 1.2179988 1.2996261

[2,] 0.2508639 0.2695631 0.4472127 0.6606040 0.6635606 0.7820818 1.0267680

[3,] 0.6498131 1.1736274 1.1906700 1.1965092 1.4579977 1.4579977 1.5401298

[4,] 0.2695631 0.3257349 0.3910409 0.5686904 0.6044323 0.6123406 0.6401741

[5,] 0.7338183 0.9272845 1.1827617 1.7344095 2.0572618 2.1129288 2.3235298

[6,] 0.5674645 0.6544263 0.9306719 1.1357241 1.1719707 1.2667669 1.3695454

6

Look at the output> head(iris.kknn$C)

[,1] [,2] [,3] [,4] [,5] [,6] [,7]

[1,] 86 38 43 73 92 85 60

[2,] 31 20 16 21 24 15 7

[3,] 48 80 44 36 50 63 98

[4,] 4 21 25 6 20 26 1

[5,] 68 79 70 65 87 84 75

[6,] 91 97 100 96 83 93 81

> head(iris.kknn$prob)

setosa versicolor virginica

[1,] 0 0.3377079 0.6622921

[2,] 1 0.0000000 0.0000000

[3,] 0 0.8060743 0.1939257

[4,] 1 0.0000000 0.0000000

[5,] 0 0.0000000 1.0000000

[6,] 0 0.0000000 1.00000007

Look at the output> head(iris.kknn$fitted.values)

[1] virginica setosa versicolor setosa virginica virginica

Levels: setosa versicolor virginica

8

Contingency tables

fitiris <- fitted(iris.kknn)

table(iris.valid$Species, fitiris)

fitiris


setosa 17 0 0

versicolor 0 18 2

virginica 0 1 12

# rectangular – no weight

fitiris2


setosa 17 0 0

versicolor 0 18 2

virginica 0 2 119

The plotpcol <- as.character(as.numeric(iris.valid$Species))

pairs(iris.valid[1:4], pch = pcol, col = c("green3", "red”)[(iris.valid$Species != fit)+1])

10

New dataset - ionosphererequire(kknn)

data(ionosphere)

ionosphere.learn <- ionosphere[1:200,]

ionosphere.valid <- ionosphere[-c(1:200),]

fit.kknn <- kknn(class ~ ., ionosphere.learn, ionosphere.valid)

table(ionosphere.valid$class, fit.kknn$fit)

b g

b 19 8

g 2 122

11

Vary the parameters - ionosphere> (fit.train1 <- train.kknn(class ~ ., ionosphere.learn, kmax = 15,

kernel = c("triangular", "rectangular", "epanechnikov", "optimal"), distance = 1))

Call:

train.kknn(formula = class ~ ., data = ionosphere.learn, kmax = 15, distance = 1, kernel = c("triangular", "rectangular", "epanechnikov", "optimal"))

Type of response variable: nominal

Minimal misclassification: 0.12

Best kernel: rectangular

Best k: 2

> table(predict(fit.train1, ionosphere.valid), ionosphere.valid$class)

b g

b 25 4

g 2 12012

b g b 19 8 g 2 122

Alter distance - ionosphere> (fit.train2 <- train.kknn(class ~ ., ionosphere.learn, kmax = 15,

kernel = c("triangular", "rectangular", "epanechnikov", "optimal"), distance = 2))

Type of response variable: nominal

Minimal misclassification: 0.12

Best kernel: rectangular

Best k: 2

> table(predict(fit.train2, ionosphere.valid), ionosphere.valid$class)

b g

b 20 5

g 7 119

13

#1 b g b 25 4 g 2 120

#0 b g b 19 8 g 2 122

(Weighted) kNN• Advantages

– Robust to noisy training data (especially if we use inverse square of weighted distance as the “distance”)

– Effective if the training data is large

• Disadvantages– Need to determine value of parameter K (number of

nearest neighbors)– Distance based learning is not clear which type of

distance to use and which attribute to use to produce the best results. Shall we use all attributes or certain attributes only?

14

Additional factors• Dimensionality – with too many dimensions

the closest neighbors are too far away to be considered close

• Overfitting – does closeness mean right classification (e.g. noise or incorrect data, like wrong street address -> wrong lat/lon) – beware of k=1!

• Correlated features – double weighting• Relative importance – including/ excluding

features15

More factors• Sparseness – the standard distance measure

(Jaccard) loses meaning due to no overlap• Errors – unintentional and intentional• Computational complexity• Sensitivity to distance metrics – especially

due to different scales (recall ages, versus impressions, versus clicks and especially binary values: gender, logged in/not)

• Does not account for changes over time• Model updating as new data comes in 16

Lots of clustering options• http://wiki.math.yorku.ca/index.php/R:

_Cluster_analysis • Clustergram - This graph is useful in

exploratory analysis for non-hierarchical clustering algorithms like k-means and for hierarchical cluster algorithms when the number of observations is large enough to make dendrograms impractical.

• (remember our attempt at a dendogram for mapmeans?) 17

http://wiki.math.yorku.ca/index.php/R:_Cluster_analysis

http://wiki.math.yorku.ca/index.php/R:_Cluster_analysis

Cluster plottingsource("http://www.r-statistics.com/wp-content/uploads/2012/01/source_https.r.txt") # source code from github

require(RCurl)

require(colorspace)

source_https("https://raw.github.com/talgalili/R-code-snippets/master/clustergram.r")

data(iris)

set.seed(250)

par(cex.lab = 1.5, cex.main = 1.2)

Data <- scale(iris[,-5]) # scaling

18

> head(iris)

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

1 5.1 3.5 1.4 0.2 setosa

2 4.9 3.0 1.4 0.2 setosa

3 4.7 3.2 1.3 0.2 setosa

4 4.6 3.1 1.5 0.2 setosa

5 5.0 3.6 1.4 0.2 setosa

6 5.4 3.9 1.7 0.4 setosa

> head(Data)

Sepal.Length Sepal.Width Petal.Length Petal.Width

[1,] -0.8976739 1.01560199 -1.335752 -1.311052

[2,] -1.1392005 -0.13153881 -1.335752 -1.311052

[3,] -1.3807271 0.32731751 -1.392399 -1.311052

[4,] -1.5014904 0.09788935 -1.279104 -1.311052

[5,] -1.0184372 1.24503015 -1.335752 -1.311052

[6,] -0.5353840 1.93331463 -1.165809 -1.04866719

20

• Look at the location of the cluster points on the Y axis. See when they remain stable, when they start flying around, and what happens to them in higher number of clusters (do they re-group together)

• Observe the strands of the datapoints. Even if the clusters centers are not ordered, the lines for each item might (needs more research and thinking) tend to move together – hinting at the real number of clusters

• Run the plot multiple times to observe the stability of the cluster formation (and location)

http://www.r-statistics.com/2010/06/clustergram-visualization-and-diagnostics-for-cluster-analysis-r-code/

clustergram(Data, k.range = 2:8, line.width = 0.004) # line.width - adjust according to Y-scale

21

Any good?set.seed(500)

Data2 <- scale(iris[,-5])

par(cex.lab = 1.2, cex.main = .7)

par(mfrow = c(3,2))

for(i in 1:6) clustergram(Data2, k.range = 2:8 , line.width = .004, add.center.points = T)

# why does this produce different plots?

# what defaults are used (kmeans)

# PCA?? Remember your linear algebra 22

How can you tell it is good?set.seed(250)

Data <- rbind( cbind(rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3)),

cbind(rnorm(100,1, sd = 0.3),rnorm(100,1, sd = 0.3),rnorm(100,1, sd = 0.3)),

cbind(rnorm(100,2, sd = 0.3),rnorm(100,2, sd = 0.3),rnorm(100,2, sd = 0.3)))

clustergram(Data, k.range = 2:5 , line.width = .004, add.center.points = T)

24

More complex…set.seed(250)

Data <- rbind( cbind(rnorm(100,1, sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3)),

cbind(rnorm(100,0, sd = 0.3),rnorm(100,1, sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3)),

cbind(rnorm(100,0, sd = 0.3),rnorm(100,1, sd = 0.3),rnorm(100,1, sd = 0.3),rnorm(100,0, sd = 0.3)),

cbind(rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,1, sd = 0.3)))

clustergram(Data, k.range = 2:8 , line.width = .004, add.center.points = T)

25

Exercise - swisspar(mfrow = c(2,3))

swiss.x <- scale(as.matrix(swiss[, -1])) set.seed(1);

for(i in 1:6) clustergram(swiss.x, k.range = 2:6, line.width = 0.01)

26

27

clusplot

Hierarchical clustering

28

> dswiss <- dist(as.matrix(swiss))

> hs <- hclust(dswiss)

> plot(hs)

ctree

29

require(party)

swiss_ctree <- ctree(Fertility ~ Agriculture + Education + Catholic, data = swiss)

plot(swiss_ctree)

pairs(iris[1:4], main = "Anderson's Iris Data -- 3 species”, pch = 21, bg = c("red", "green3", "blue")[unclass(iris$Species)])

31

splom extra!require(lattice)

super.sym <- trellis.par.get("superpose.symbol")

splom(~iris[1:4], groups = Species, data = iris,

panel = panel.superpose,

key = list(title = "Three Varieties of Iris",

columns = 3,

points = list(pch = super.sym$pch[1:3],

col = super.sym$col[1:3]),

text = list(c("Setosa", "Versicolor", "Virginica"))))

splom(~iris[1:3]|Species, data = iris,

layout=c(2,2), pscales = 0,

varnames = c("Sepal\nLength", "Sepal\nWidth", "Petal\nLength"),

page = function(...) {

ltext(x = seq(.6, .8, length.out = 4),

y = seq(.9, .6, length.out = 4),

labels = c("Three", "Varieties", "of", "Iris"),

cex = 2)

})

32

33

parallelplot(~iris[1:4] | Species, iris)

34

parallelplot(~iris[1:4], iris, groups = Species, horizontal.axis = FALSE, scales = list(x = list(rot = 90)))

hclust for iris

35

plot(iris_ctree)

36

Ctree> iris_ctree <- ctree(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data=iris)

> print(iris_ctree)

Conditional inference tree with 4 terminal nodes

Response: Species

Inputs: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width

Number of observations: 150

1) Petal.Length <= 1.9; criterion = 1, statistic = 140.264

2)* weights = 50

1) Petal.Length > 1.9

3) Petal.Width <= 1.7; criterion = 1, statistic = 67.894

4) Petal.Length <= 4.8; criterion = 0.999, statistic = 13.865

5)* weights = 46

4) Petal.Length > 4.8

6)* weights = 8

3) Petal.Width > 1.7

7)* weights = 46 37

> plot(iris_ctree, type="simple”)

38

New dataset to work with treesfitK <- rpart(Kyphosis ~ Age + Number + Start, method="class", data=kyphosis)

printcp(fitK) # display the results

plotcp(fitK) # visualize cross-validation results

summary(fitK) # detailed summary of splits

# plot tree

plot(fitK, uniform=TRUE, main="Classification Tree for Kyphosis")

text(fitK, use.n=TRUE, all=TRUE, cex=.8)

# create attractive postscript plot of tree

post(fitK, file = “kyphosistree.ps", title = "Classification Tree for Kyphosis") # might need to convert to PDF (distill)

39

41

> pfitK<- prune(fitK, cp= fitK$cptable[which.min(fitK$cptable[,"xerror"]),"CP"])> plot(pfitK, uniform=TRUE, main="Pruned Classification Tree for Kyphosis")> text(pfitK, use.n=TRUE, all=TRUE, cex=.8)> post(pfitK, file = “ptree.ps", title = "Pruned Classification Tree for Kyphosis”)

42

> fitK <- ctree(Kyphosis ~ Age + Number + Start, data=kyphosis)> plot(fitK, main="Conditional Inference Tree for Kyphosis”)

43

> plot(fitK, main="Conditional Inference Tree for Kyphosis",type="simple")

randomForest> require(randomForest)

> fitKF <- randomForest(Kyphosis ~ Age + Number + Start, data=kyphosis)

> print(fitKF) # view results

Call:

randomForest(formula = Kyphosis ~ Age + Number + Start, data = kyphosis)

Type of random forest: classification

Number of trees: 500

No. of variables tried at each split: 1

OOB estimate of error rate: 20.99%

Confusion matrix:

absent present class.error

absent 59 5 0.0781250

present 12 5 0.7058824

> importance(fitKF) # importance of each predictor

MeanDecreaseGini

Age 8.654112

Number 5.584019

Start 10.168591 44

Random forests improve predictive accuracy by generating a large number of bootstrapped trees (based on random samples of variables), classifying a case using each tree in this new "forest", and deciding a final predicted outcome by combining the results across all of the trees (an average in regression, a majority vote in classification).

More on another dataset.# Regression Tree Example

library(rpart)

# build the tree

fitM <- rpart(Mileage~Price + Country + Reliability + Type, method="anova", data=cu.summary)

printcp(fitM) # display the results….

Root node error: 1354.6/60 = 22.576

n=60 (57 observations deleted due to missingness)

CP nsplit rel error xerror xstd

1 0.622885 0 1.00000 1.03165 0.176920

2 0.132061 1 0.37711 0.51693 0.102454

3 0.025441 2 0.24505 0.36063 0.079819

4 0.011604 3 0.21961 0.34878 0.080273

5 0.010000 4 0.20801 0.36392 0.075650 45

Mileage…plotcp(fitM) # visualize cross-validation results

summary(fitM) # detailed summary of splits

<we will leave this for Friday to look at> 46

47

par(mfrow=c(1,2)) rsq.rpart(fitM) # visualize cross-validation results

# plot tree

plot(fitM, uniform=TRUE, main="Regression Tree for Mileage ")

text(fitM, use.n=TRUE, all=TRUE, cex=.8)

# prune the tree

pfitM<- prune(fitM, cp=0.01160389) # from cptable

# plot the pruned tree

plot(pfitM, uniform=TRUE, main="Pruned Regression Tree for Mileage")

text(pfitM, use.n=TRUE, all=TRUE, cex=.8)

post(pfitM, file = ”ptree2.ps", title = "Pruned Regression Tree for Mileage”)

48

# Conditional Inference Tree for Mileage

fit2M <- ctree(Mileage~Price + Country + Reliability + Type, data=na.omit(cu.summary))

50

Enough of trees!

51

Bayes> cl <- kmeans(iris[,1:4], 3)

> table(cl$cluster, iris[,5])


2 0 2 36

1 0 48 14

3 50 0 0

#

> m <- naiveBayes(iris[,1:4], iris[,5])

> table(predict(m, iris[,1:4]), iris[,5])


setosa 50 0 0

versicolor 0 47 3

virginica 0 3 47 52

pairs(iris[1:4],main="Iris Data (red=setosa,green=versicolor,blue=virginica)", pch=21, bg=c("red","green3","blue")[unclass(iris$Species)])

Digging into irisclassifier<-naiveBayes(iris[,1:4], iris[,5])

table(predict(classifier, iris[,-5]), iris[,5], dnn=list('predicted','actual'))

actual

predicted setosa versicolor virginica

setosa 50 0 0

versicolor 0 47 3

virginica 0 3 47

53

Digging into iris> classifier$apriori

iris[, 5]


50 50 50

> classifier$tables$Petal.Length

Petal.Length

iris[, 5] [,1] [,2]

setosa 1.462 0.1736640

versicolor 4.260 0.4699110

virginica 5.552 0.551894754

Digging into irisplot(function(x) dnorm(x, 1.462, 0.1736640), 0, 8, col="red", main="Petal length distribution for the 3 different species")

curve(dnorm(x, 4.260, 0.4699110), add=TRUE, col="blue")

curve(dnorm(x, 5.552, 0.5518947 ), add=TRUE, col = "green")

55

http://www.ugrad.stat.ubc.ca/R/library/mlbench/html/HouseVotes84.

html > require(mlbench)

> data(HouseVotes84)

> model <- naiveBayes(Class ~ ., data = HouseVotes84)

> predict(model, HouseVotes84[1:10,-1])

[1] republican republican republican democrat democrat democrat republican republican republican

[10] democrat

Levels: democrat republican 56

http://www.ugrad.stat.ubc.ca/R/library/mlbench/html/HouseVotes84.html



House Votes 1984> predict(model, HouseVotes84[1:10,-1], type = "raw")

democrat republican

[1,] 1.029209e-07 9.999999e-01

[2,] 5.820415e-08 9.999999e-01

[3,] 5.684937e-03 9.943151e-01

[4,] 9.985798e-01 1.420152e-03

[5,] 9.666720e-01 3.332802e-02

[6,] 8.121430e-01 1.878570e-01

[7,] 1.751512e-04 9.998248e-01

[8,] 8.300100e-06 9.999917e-01

[9,] 8.277705e-08 9.999999e-01

[10,] 1.000000e+00 5.029425e-1157

House Votes 1984

> pred <- predict(model, HouseVotes84[,-1])

> table(pred, HouseVotes84$Class)

pred democrat republican

democrat 238 13

republican 29 155

58

So now you could complete this:> data(HairEyeColor)

> mosaicplot(HairEyeColor)

> margin.table(HairEyeColor,3)

Sex

Male Female

279 313

> margin.table(HairEyeColor,c(1,3))

Sex

Hair Male Female

Black 56 52

Brown 143 143

Red 34 37

Blond 46 81

Construct a naïve Bayes classifier and test. 59

Assignments to come…

• Term project (A6). Due ~ week 13. 30% (25% written, 5% oral; individual).

• Assignment 7: Predictive and Prescriptive Analytics. Due ~ week 9/10. 20% (15% written and 5% oral; individual);

60

Coming weeks• I will be out of town Friday March 21 and 28• On March 21 you will have a lab –

attendance will be taken – to work on assignments (term (6) and assignment 7). Your project proposals (Assignment 5) are on March 18.

• On March 28 you will have a lecture on SVM, thus the Tuesday March 25 will be a lab.

• Back to regular schedule in April (except 18th)61

Admin info (keep/ print this slide)• Class: ITWS-4963/ITWS 6965• Hours: 12:00pm-1:50pm Tuesday/ Friday• Location: SAGE 3101• Instructor: Peter Fox• Instructor contact: [email protected], 518.276.4862 (do not

leave a msg)• Contact hours: Monday** 3:00-4:00pm (or by email appt)• Contact location: Winslow 2120 (sometimes Lally 207A

announced by email)• TA: Lakshmi Chenicheri [email protected] • Web site: http://tw.rpi.edu/web/courses/DataAnalytics/2014

– Schedule, lectures, syllabus, reading, assignments, etc.

62

mailto:[email protected]

mailto:[email protected]

http://tw.rpi.edu/web/courses/DataAnalytics/2014

interpreting weighted knn , forms of clustering, decision trees and b ayesian i nference

Documents