interpreting weighted knn , forms of clustering, decision trees and b ayesian i nference

62
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 7a, March 3, 2014, SAGE 3101 Interpreting weighted kNN, forms of clustering, decision trees and Bayesian inference

Upload: tamera

Post on 16-Jan-2016

40 views

Category:

Documents


0 download

DESCRIPTION

Interpreting weighted kNN , forms of clustering, decision trees and B ayesian i nference. Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 7a , March 3 , 2014, SAGE 3101. Contents. Weighted KNN. r equire( kknn ) data (iris) m

TRANSCRIPT

Page 1: Interpreting weighted  kNN , forms  of clustering,  decision trees and  B ayesian  i nference

1

Peter Fox

Data Analytics – ITWS-4963/ITWS-6965

Week 7a, March 3, 2014, SAGE 3101

Interpreting weighted kNN, forms of clustering, decision trees and

Bayesian inference

Page 2: Interpreting weighted  kNN , forms  of clustering,  decision trees and  B ayesian  i nference

Contents

2

Page 3: Interpreting weighted  kNN , forms  of clustering,  decision trees and  B ayesian  i nference

Weighted KNNrequire(kknn)

data(iris)

m <- dim(iris)[1]

val <- sample(1:m, size = round(m/3), replace = FALSE, prob = rep(1/m, m))

iris.learn <- iris[-val,] # train

iris.valid <- iris[val,] # test

iris.kknn <- kknn(Species~., iris.learn, iris.valid, distance = 1, kernel = "triangular") # Possible choices are "rectangular" (which is standard unweighted knn), "triangular", "epanechnikov" (or beta(2,2)), "biweight" (or beta(3,3)), "triweight" (or beta(4,4)), "cos", "inv", "gaussian", "rank" and "optimal".

3

Page 4: Interpreting weighted  kNN , forms  of clustering,  decision trees and  B ayesian  i nference

names(iris.kknn)• fitted.values Vector of predictions.• CL Matrix of classes of the k nearest neighbors.• W Matrix of weights of the k nearest neighbors.• D Matrix of distances of the k nearest neighbors.• C Matrix of indices of the k nearest neighbors.• prob Matrix of predicted class probabilities.• response Type of response variable, one of

continuous, nominal or ordinal.• distance Parameter of Minkowski distance.• call The matched call.• termsThe 'terms' object used. 4

Page 5: Interpreting weighted  kNN , forms  of clustering,  decision trees and  B ayesian  i nference

Look at the output> head(iris.kknn$W)

[,1] [,2] [,3] [,4] [,5] [,6] [,7]

[1,] 0.4493696 0.2306555 0.1261857 0.1230131 0.07914805 0.07610159 0.014184110

[2,] 0.7567298 0.7385966 0.5663245 0.3593925 0.35652546 0.24159191 0.004312408

[3,] 0.5958406 0.2700476 0.2594478 0.2558161 0.09317996 0.09317996 0.042096849

[4,] 0.6022069 0.5193145 0.4229427 0.1607861 0.10804205 0.09637177 0.055297983

[5,] 0.7011985 0.6224216 0.5183945 0.2937705 0.16230921 0.13964231 0.053888244

[6,] 0.5898731 0.5270226 0.3273701 0.1791715 0.15297478 0.08446215 0.010180454

5

Page 6: Interpreting weighted  kNN , forms  of clustering,  decision trees and  B ayesian  i nference

Look at the output> head(iris.kknn$D)

[,1] [,2] [,3] [,4] [,5] [,6] [,7]

[1,] 0.7259100 1.0142464 1.1519716 1.1561541 1.2139825 1.2179988 1.2996261

[2,] 0.2508639 0.2695631 0.4472127 0.6606040 0.6635606 0.7820818 1.0267680

[3,] 0.6498131 1.1736274 1.1906700 1.1965092 1.4579977 1.4579977 1.5401298

[4,] 0.2695631 0.3257349 0.3910409 0.5686904 0.6044323 0.6123406 0.6401741

[5,] 0.7338183 0.9272845 1.1827617 1.7344095 2.0572618 2.1129288 2.3235298

[6,] 0.5674645 0.6544263 0.9306719 1.1357241 1.1719707 1.2667669 1.3695454

6

Page 7: Interpreting weighted  kNN , forms  of clustering,  decision trees and  B ayesian  i nference

Look at the output> head(iris.kknn$C)

[,1] [,2] [,3] [,4] [,5] [,6] [,7]

[1,] 86 38 43 73 92 85 60

[2,] 31 20 16 21 24 15 7

[3,] 48 80 44 36 50 63 98

[4,] 4 21 25 6 20 26 1

[5,] 68 79 70 65 87 84 75

[6,] 91 97 100 96 83 93 81

> head(iris.kknn$prob)

setosa versicolor virginica

[1,] 0 0.3377079 0.6622921

[2,] 1 0.0000000 0.0000000

[3,] 0 0.8060743 0.1939257

[4,] 1 0.0000000 0.0000000

[5,] 0 0.0000000 1.0000000

[6,] 0 0.0000000 1.00000007

Page 8: Interpreting weighted  kNN , forms  of clustering,  decision trees and  B ayesian  i nference

Look at the output> head(iris.kknn$fitted.values)

[1] virginica setosa versicolor setosa virginica virginica

Levels: setosa versicolor virginica

8

Page 9: Interpreting weighted  kNN , forms  of clustering,  decision trees and  B ayesian  i nference

Contingency tables

fitiris <- fitted(iris.kknn)

table(iris.valid$Species, fitiris)

fitiris

setosa versicolor virginica

setosa 17 0 0

versicolor 0 18 2

virginica 0 1 12

# rectangular – no weight

fitiris2

setosa versicolor virginica

setosa 17 0 0

versicolor 0 18 2

virginica 0 2 119

Page 10: Interpreting weighted  kNN , forms  of clustering,  decision trees and  B ayesian  i nference

The plotpcol <- as.character(as.numeric(iris.valid$Species))

pairs(iris.valid[1:4], pch = pcol, col = c("green3", "red”)[(iris.valid$Species != fit)+1])

10

Page 11: Interpreting weighted  kNN , forms  of clustering,  decision trees and  B ayesian  i nference

New dataset - ionosphererequire(kknn)

data(ionosphere)

ionosphere.learn <- ionosphere[1:200,]

ionosphere.valid <- ionosphere[-c(1:200),]

fit.kknn <- kknn(class ~ ., ionosphere.learn, ionosphere.valid)

table(ionosphere.valid$class, fit.kknn$fit)

b g

b 19 8

g 2 122

11

Page 12: Interpreting weighted  kNN , forms  of clustering,  decision trees and  B ayesian  i nference

Vary the parameters - ionosphere> (fit.train1 <- train.kknn(class ~ ., ionosphere.learn, kmax = 15,

kernel = c("triangular", "rectangular", "epanechnikov", "optimal"), distance = 1))

Call:

train.kknn(formula = class ~ ., data = ionosphere.learn, kmax = 15, distance = 1, kernel = c("triangular", "rectangular", "epanechnikov", "optimal"))

Type of response variable: nominal

Minimal misclassification: 0.12

Best kernel: rectangular

Best k: 2

> table(predict(fit.train1, ionosphere.valid), ionosphere.valid$class)

b g

b 25 4

g 2 12012

b g b 19 8 g 2 122

Page 13: Interpreting weighted  kNN , forms  of clustering,  decision trees and  B ayesian  i nference

Alter distance - ionosphere> (fit.train2 <- train.kknn(class ~ ., ionosphere.learn, kmax = 15,

kernel = c("triangular", "rectangular", "epanechnikov", "optimal"), distance = 2))

Type of response variable: nominal

Minimal misclassification: 0.12

Best kernel: rectangular

Best k: 2

> table(predict(fit.train2, ionosphere.valid), ionosphere.valid$class)

b g

b 20 5

g 7 119

13

#1 b g b 25 4 g 2 120

#0 b g b 19 8 g 2 122

Page 14: Interpreting weighted  kNN , forms  of clustering,  decision trees and  B ayesian  i nference

(Weighted) kNN• Advantages

– Robust to noisy training data (especially if we use inverse square of weighted distance as the “distance”)

– Effective if the training data is large

• Disadvantages– Need to determine value of parameter K (number of

nearest neighbors)– Distance based learning is not clear which type of

distance to use and which attribute to use to produce the best results. Shall we use all attributes or certain attributes only?

14

Page 15: Interpreting weighted  kNN , forms  of clustering,  decision trees and  B ayesian  i nference

Additional factors• Dimensionality – with too many dimensions

the closest neighbors are too far away to be considered close

• Overfitting – does closeness mean right classification (e.g. noise or incorrect data, like wrong street address -> wrong lat/lon) – beware of k=1!

• Correlated features – double weighting• Relative importance – including/ excluding

features15

Page 16: Interpreting weighted  kNN , forms  of clustering,  decision trees and  B ayesian  i nference

More factors• Sparseness – the standard distance measure

(Jaccard) loses meaning due to no overlap• Errors – unintentional and intentional• Computational complexity• Sensitivity to distance metrics – especially

due to different scales (recall ages, versus impressions, versus clicks and especially binary values: gender, logged in/not)

• Does not account for changes over time• Model updating as new data comes in 16

Page 17: Interpreting weighted  kNN , forms  of clustering,  decision trees and  B ayesian  i nference

Lots of clustering options• http://wiki.math.yorku.ca/index.php/R:

_Cluster_analysis • Clustergram - This graph is useful in

exploratory analysis for non-hierarchical clustering algorithms like k-means and for hierarchical cluster algorithms when the number of observations is large enough to make dendrograms impractical.

• (remember our attempt at a dendogram for mapmeans?) 17

Page 18: Interpreting weighted  kNN , forms  of clustering,  decision trees and  B ayesian  i nference

Cluster plottingsource("http://www.r-statistics.com/wp-content/uploads/2012/01/source_https.r.txt") # source code from github

require(RCurl)

require(colorspace)

source_https("https://raw.github.com/talgalili/R-code-snippets/master/clustergram.r")

data(iris)

set.seed(250)

par(cex.lab = 1.5, cex.main = 1.2)

Data <- scale(iris[,-5]) # scaling

18

Page 19: Interpreting weighted  kNN , forms  of clustering,  decision trees and  B ayesian  i nference

> head(iris)

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

1 5.1 3.5 1.4 0.2 setosa

2 4.9 3.0 1.4 0.2 setosa

3 4.7 3.2 1.3 0.2 setosa

4 4.6 3.1 1.5 0.2 setosa

5 5.0 3.6 1.4 0.2 setosa

6 5.4 3.9 1.7 0.4 setosa

> head(Data)

Sepal.Length Sepal.Width Petal.Length Petal.Width

[1,] -0.8976739 1.01560199 -1.335752 -1.311052

[2,] -1.1392005 -0.13153881 -1.335752 -1.311052

[3,] -1.3807271 0.32731751 -1.392399 -1.311052

[4,] -1.5014904 0.09788935 -1.279104 -1.311052

[5,] -1.0184372 1.24503015 -1.335752 -1.311052

[6,] -0.5353840 1.93331463 -1.165809 -1.04866719

Page 20: Interpreting weighted  kNN , forms  of clustering,  decision trees and  B ayesian  i nference

20

• Look at the location of the cluster points on the Y axis. See when they remain stable, when they start flying around, and what happens to them in higher number of clusters (do they re-group together)

• Observe the strands of the datapoints. Even if the clusters centers are not ordered, the lines for each item might (needs more research and thinking) tend to move together – hinting at the real number of clusters

• Run the plot multiple times to observe the stability of the cluster formation (and location)

http://www.r-statistics.com/2010/06/clustergram-visualization-and-diagnostics-for-cluster-analysis-r-code/

Page 21: Interpreting weighted  kNN , forms  of clustering,  decision trees and  B ayesian  i nference

clustergram(Data, k.range = 2:8, line.width = 0.004) # line.width - adjust according to Y-scale

21

Page 22: Interpreting weighted  kNN , forms  of clustering,  decision trees and  B ayesian  i nference

Any good?set.seed(500)

Data2 <- scale(iris[,-5])

par(cex.lab = 1.2, cex.main = .7)

par(mfrow = c(3,2))

for(i in 1:6) clustergram(Data2, k.range = 2:8 , line.width = .004, add.center.points = T)

# why does this produce different plots?

# what defaults are used (kmeans)

# PCA?? Remember your linear algebra 22

Page 23: Interpreting weighted  kNN , forms  of clustering,  decision trees and  B ayesian  i nference

23

Page 24: Interpreting weighted  kNN , forms  of clustering,  decision trees and  B ayesian  i nference

How can you tell it is good?set.seed(250)

Data <- rbind( cbind(rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3)),

cbind(rnorm(100,1, sd = 0.3),rnorm(100,1, sd = 0.3),rnorm(100,1, sd = 0.3)),

cbind(rnorm(100,2, sd = 0.3),rnorm(100,2, sd = 0.3),rnorm(100,2, sd = 0.3)))

clustergram(Data, k.range = 2:5 , line.width = .004, add.center.points = T)

24

Page 25: Interpreting weighted  kNN , forms  of clustering,  decision trees and  B ayesian  i nference

More complex…set.seed(250)

Data <- rbind( cbind(rnorm(100,1, sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3)),

cbind(rnorm(100,0, sd = 0.3),rnorm(100,1, sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3)),

cbind(rnorm(100,0, sd = 0.3),rnorm(100,1, sd = 0.3),rnorm(100,1, sd = 0.3),rnorm(100,0, sd = 0.3)),

cbind(rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,1, sd = 0.3)))

clustergram(Data, k.range = 2:8 , line.width = .004, add.center.points = T)

25

Page 26: Interpreting weighted  kNN , forms  of clustering,  decision trees and  B ayesian  i nference

Exercise - swisspar(mfrow = c(2,3))

swiss.x <- scale(as.matrix(swiss[, -1])) set.seed(1);

for(i in 1:6) clustergram(swiss.x, k.range = 2:6, line.width = 0.01)

26

Page 27: Interpreting weighted  kNN , forms  of clustering,  decision trees and  B ayesian  i nference

27

clusplot

Page 28: Interpreting weighted  kNN , forms  of clustering,  decision trees and  B ayesian  i nference

Hierarchical clustering

28

> dswiss <- dist(as.matrix(swiss))

> hs <- hclust(dswiss)

> plot(hs)

Page 29: Interpreting weighted  kNN , forms  of clustering,  decision trees and  B ayesian  i nference

ctree

29

require(party)

swiss_ctree <- ctree(Fertility ~ Agriculture + Education + Catholic, data = swiss)

plot(swiss_ctree)

Page 30: Interpreting weighted  kNN , forms  of clustering,  decision trees and  B ayesian  i nference

30

Page 31: Interpreting weighted  kNN , forms  of clustering,  decision trees and  B ayesian  i nference

pairs(iris[1:4], main = "Anderson's Iris Data -- 3 species”, pch = 21, bg = c("red", "green3", "blue")[unclass(iris$Species)])

31

Page 32: Interpreting weighted  kNN , forms  of clustering,  decision trees and  B ayesian  i nference

splom extra!require(lattice)

super.sym <- trellis.par.get("superpose.symbol")

splom(~iris[1:4], groups = Species, data = iris,

panel = panel.superpose,

key = list(title = "Three Varieties of Iris",

columns = 3,

points = list(pch = super.sym$pch[1:3],

col = super.sym$col[1:3]),

text = list(c("Setosa", "Versicolor", "Virginica"))))

splom(~iris[1:3]|Species, data = iris,

layout=c(2,2), pscales = 0,

varnames = c("Sepal\nLength", "Sepal\nWidth", "Petal\nLength"),

page = function(...) {

ltext(x = seq(.6, .8, length.out = 4),

y = seq(.9, .6, length.out = 4),

labels = c("Three", "Varieties", "of", "Iris"),

cex = 2)

})

32

Page 33: Interpreting weighted  kNN , forms  of clustering,  decision trees and  B ayesian  i nference

33

parallelplot(~iris[1:4] | Species, iris)

Page 34: Interpreting weighted  kNN , forms  of clustering,  decision trees and  B ayesian  i nference

34

parallelplot(~iris[1:4], iris, groups = Species, horizontal.axis = FALSE, scales = list(x = list(rot = 90)))

Page 35: Interpreting weighted  kNN , forms  of clustering,  decision trees and  B ayesian  i nference

hclust for iris

35

Page 36: Interpreting weighted  kNN , forms  of clustering,  decision trees and  B ayesian  i nference

plot(iris_ctree)

36

Page 37: Interpreting weighted  kNN , forms  of clustering,  decision trees and  B ayesian  i nference

Ctree> iris_ctree <- ctree(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data=iris)

> print(iris_ctree)

Conditional inference tree with 4 terminal nodes

Response: Species

Inputs: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width

Number of observations: 150

1) Petal.Length <= 1.9; criterion = 1, statistic = 140.264

2)* weights = 50

1) Petal.Length > 1.9

3) Petal.Width <= 1.7; criterion = 1, statistic = 67.894

4) Petal.Length <= 4.8; criterion = 0.999, statistic = 13.865

5)* weights = 46

4) Petal.Length > 4.8

6)* weights = 8

3) Petal.Width > 1.7

7)* weights = 46 37

Page 38: Interpreting weighted  kNN , forms  of clustering,  decision trees and  B ayesian  i nference

> plot(iris_ctree, type="simple”)

38

Page 39: Interpreting weighted  kNN , forms  of clustering,  decision trees and  B ayesian  i nference

New dataset to work with treesfitK <- rpart(Kyphosis ~ Age + Number + Start, method="class", data=kyphosis)

printcp(fitK) # display the results

plotcp(fitK) # visualize cross-validation results

summary(fitK) # detailed summary of splits

# plot tree

plot(fitK, uniform=TRUE, main="Classification Tree for Kyphosis")

text(fitK, use.n=TRUE, all=TRUE, cex=.8)

# create attractive postscript plot of tree

post(fitK, file = “kyphosistree.ps", title = "Classification Tree for Kyphosis") # might need to convert to PDF (distill)

39

Page 40: Interpreting weighted  kNN , forms  of clustering,  decision trees and  B ayesian  i nference

40

Page 41: Interpreting weighted  kNN , forms  of clustering,  decision trees and  B ayesian  i nference

41

> pfitK<- prune(fitK, cp= fitK$cptable[which.min(fitK$cptable[,"xerror"]),"CP"])> plot(pfitK, uniform=TRUE, main="Pruned Classification Tree for Kyphosis")> text(pfitK, use.n=TRUE, all=TRUE, cex=.8)> post(pfitK, file = “ptree.ps", title = "Pruned Classification Tree for Kyphosis”)

Page 42: Interpreting weighted  kNN , forms  of clustering,  decision trees and  B ayesian  i nference

42

> fitK <- ctree(Kyphosis ~ Age + Number + Start, data=kyphosis)> plot(fitK, main="Conditional Inference Tree for Kyphosis”)

Page 43: Interpreting weighted  kNN , forms  of clustering,  decision trees and  B ayesian  i nference

43

> plot(fitK, main="Conditional Inference Tree for Kyphosis",type="simple")

Page 44: Interpreting weighted  kNN , forms  of clustering,  decision trees and  B ayesian  i nference

randomForest> require(randomForest)

> fitKF <- randomForest(Kyphosis ~ Age + Number + Start, data=kyphosis)

> print(fitKF) # view results

Call:

randomForest(formula = Kyphosis ~ Age + Number + Start, data = kyphosis)

Type of random forest: classification

Number of trees: 500

No. of variables tried at each split: 1

OOB estimate of error rate: 20.99%

Confusion matrix:

absent present class.error

absent 59 5 0.0781250

present 12 5 0.7058824

> importance(fitKF) # importance of each predictor

MeanDecreaseGini

Age 8.654112

Number 5.584019

Start 10.168591 44

Random forests improve predictive accuracy by generating a large number of bootstrapped trees (based on random samples of variables), classifying a case using each tree in this new "forest", and deciding a final predicted outcome by combining the results across all of the trees (an average in regression, a majority vote in classification).

Page 45: Interpreting weighted  kNN , forms  of clustering,  decision trees and  B ayesian  i nference

More on another dataset.# Regression Tree Example

library(rpart)

# build the tree

fitM <- rpart(Mileage~Price + Country + Reliability + Type, method="anova", data=cu.summary)

printcp(fitM) # display the results….

Root node error: 1354.6/60 = 22.576

n=60 (57 observations deleted due to missingness)

CP nsplit rel error xerror xstd

1 0.622885 0 1.00000 1.03165 0.176920

2 0.132061 1 0.37711 0.51693 0.102454

3 0.025441 2 0.24505 0.36063 0.079819

4 0.011604 3 0.21961 0.34878 0.080273

5 0.010000 4 0.20801 0.36392 0.075650 45

Page 46: Interpreting weighted  kNN , forms  of clustering,  decision trees and  B ayesian  i nference

Mileage…plotcp(fitM) # visualize cross-validation results

summary(fitM) # detailed summary of splits

<we will leave this for Friday to look at> 46

Page 47: Interpreting weighted  kNN , forms  of clustering,  decision trees and  B ayesian  i nference

47

par(mfrow=c(1,2)) rsq.rpart(fitM) # visualize cross-validation results

Page 48: Interpreting weighted  kNN , forms  of clustering,  decision trees and  B ayesian  i nference

# plot tree

plot(fitM, uniform=TRUE, main="Regression Tree for Mileage ")

text(fitM, use.n=TRUE, all=TRUE, cex=.8)

# prune the tree

pfitM<- prune(fitM, cp=0.01160389) # from cptable

# plot the pruned tree

plot(pfitM, uniform=TRUE, main="Pruned Regression Tree for Mileage")

text(pfitM, use.n=TRUE, all=TRUE, cex=.8)

post(pfitM, file = ”ptree2.ps", title = "Pruned Regression Tree for Mileage”)

48

Page 49: Interpreting weighted  kNN , forms  of clustering,  decision trees and  B ayesian  i nference

49

Page 50: Interpreting weighted  kNN , forms  of clustering,  decision trees and  B ayesian  i nference

# Conditional Inference Tree for Mileage

fit2M <- ctree(Mileage~Price + Country + Reliability + Type, data=na.omit(cu.summary))

50

Page 51: Interpreting weighted  kNN , forms  of clustering,  decision trees and  B ayesian  i nference

Enough of trees!

51

Page 52: Interpreting weighted  kNN , forms  of clustering,  decision trees and  B ayesian  i nference

Bayes> cl <- kmeans(iris[,1:4], 3)

> table(cl$cluster, iris[,5])

setosa versicolor virginica

2 0 2 36

1 0 48 14

3 50 0 0

#

> m <- naiveBayes(iris[,1:4], iris[,5])

> table(predict(m, iris[,1:4]), iris[,5])

setosa versicolor virginica

setosa 50 0 0

versicolor 0 47 3

virginica 0 3 47 52

pairs(iris[1:4],main="Iris Data (red=setosa,green=versicolor,blue=virginica)", pch=21, bg=c("red","green3","blue")[unclass(iris$Species)])

Page 53: Interpreting weighted  kNN , forms  of clustering,  decision trees and  B ayesian  i nference

Digging into irisclassifier<-naiveBayes(iris[,1:4], iris[,5])

table(predict(classifier, iris[,-5]), iris[,5], dnn=list('predicted','actual'))

actual

predicted setosa versicolor virginica

setosa 50 0 0

versicolor 0 47 3

virginica 0 3 47

53

Page 54: Interpreting weighted  kNN , forms  of clustering,  decision trees and  B ayesian  i nference

Digging into iris> classifier$apriori

iris[, 5]

setosa versicolor virginica

50 50 50

> classifier$tables$Petal.Length

Petal.Length

iris[, 5] [,1] [,2]

setosa 1.462 0.1736640

versicolor 4.260 0.4699110

virginica 5.552 0.551894754

Page 55: Interpreting weighted  kNN , forms  of clustering,  decision trees and  B ayesian  i nference

Digging into irisplot(function(x) dnorm(x, 1.462, 0.1736640), 0, 8, col="red", main="Petal length distribution for the 3 different species")

curve(dnorm(x, 4.260, 0.4699110), add=TRUE, col="blue")

curve(dnorm(x, 5.552, 0.5518947 ), add=TRUE, col = "green")

55

Page 56: Interpreting weighted  kNN , forms  of clustering,  decision trees and  B ayesian  i nference

http://www.ugrad.stat.ubc.ca/R/library/mlbench/html/HouseVotes84.

html > require(mlbench)

> data(HouseVotes84)

> model <- naiveBayes(Class ~ ., data = HouseVotes84)

> predict(model, HouseVotes84[1:10,-1])

[1] republican republican republican democrat democrat democrat republican republican republican

[10] democrat

Levels: democrat republican 56

Page 57: Interpreting weighted  kNN , forms  of clustering,  decision trees and  B ayesian  i nference

House Votes 1984> predict(model, HouseVotes84[1:10,-1], type = "raw")

democrat republican

[1,] 1.029209e-07 9.999999e-01

[2,] 5.820415e-08 9.999999e-01

[3,] 5.684937e-03 9.943151e-01

[4,] 9.985798e-01 1.420152e-03

[5,] 9.666720e-01 3.332802e-02

[6,] 8.121430e-01 1.878570e-01

[7,] 1.751512e-04 9.998248e-01

[8,] 8.300100e-06 9.999917e-01

[9,] 8.277705e-08 9.999999e-01

[10,] 1.000000e+00 5.029425e-1157

Page 58: Interpreting weighted  kNN , forms  of clustering,  decision trees and  B ayesian  i nference

House Votes 1984

> pred <- predict(model, HouseVotes84[,-1])

> table(pred, HouseVotes84$Class)

pred democrat republican

democrat 238 13

republican 29 155

58

Page 59: Interpreting weighted  kNN , forms  of clustering,  decision trees and  B ayesian  i nference

So now you could complete this:> data(HairEyeColor)

> mosaicplot(HairEyeColor)

> margin.table(HairEyeColor,3)

Sex

Male Female

279 313

> margin.table(HairEyeColor,c(1,3))

Sex

Hair Male Female

Black 56 52

Brown 143 143

Red 34 37

Blond 46 81

Construct a naïve Bayes classifier and test. 59

Page 60: Interpreting weighted  kNN , forms  of clustering,  decision trees and  B ayesian  i nference

Assignments to come…

• Term project (A6). Due ~ week 13. 30% (25% written, 5% oral; individual).

• Assignment 7: Predictive and Prescriptive Analytics. Due ~ week 9/10. 20% (15% written and 5% oral; individual);

60

Page 61: Interpreting weighted  kNN , forms  of clustering,  decision trees and  B ayesian  i nference

Coming weeks• I will be out of town Friday March 21 and 28• On March 21 you will have a lab –

attendance will be taken – to work on assignments (term (6) and assignment 7). Your project proposals (Assignment 5) are on March 18.

• On March 28 you will have a lecture on SVM, thus the Tuesday March 25 will be a lab.

• Back to regular schedule in April (except 18th)61

Page 62: Interpreting weighted  kNN , forms  of clustering,  decision trees and  B ayesian  i nference

Admin info (keep/ print this slide)• Class: ITWS-4963/ITWS 6965• Hours: 12:00pm-1:50pm Tuesday/ Friday• Location: SAGE 3101• Instructor: Peter Fox• Instructor contact: [email protected], 518.276.4862 (do not

leave a msg)• Contact hours: Monday** 3:00-4:00pm (or by email appt)• Contact location: Winslow 2120 (sometimes Lally 207A

announced by email)• TA: Lakshmi Chenicheri [email protected] • Web site: http://tw.rpi.edu/web/courses/DataAnalytics/2014

– Schedule, lectures, syllabus, reading, assignments, etc.

62