lecture data mining in r 732a44 programming in r

732A44 Programming in R

Lecture

Data Mining in R


Logistic regression: two classes• Consider Logistic model with one predictor X=Price of the car

Y=Equipment• Logistic model

• Use function glm(formula, family, data)– Formula: Response~Model

• Model consists of a+b (addition), a:b (interaction terms, a*b (addition and interaction) . All predictors

– Family: specify binomial

)exp(1

)exp()|1(

)|0(1

)|1(log

)|0(

)|1(log

10

10

10

x

xxXYP

xxXYP

xXYP

xXYP

xXYP


Logistic regression: two classes

reg<-glm(X3...Equipment~Price.in.SEK., family=binomial, data=mydata);


Logistic regression: several predictors

Data about contraceptive use

– Several analysis plots can be obtained by plot(lrfit)

– Response: matrix success/failure

)exp(1

)exp()|1(

)|0(

)|1(log

1

1

1

0

0

0

x

xx

xx

x

T

T

T

XYP

XYP

XYP


Logistic regression

Further comments• Nominal logistic regressions (library mlogit, function

mlogit) • Stepwise model selection: step() function.• Prediction: predict() function

Smoothing splines

Minimize a penalized sum of squared residuals

where λ is smoothing parameter.

λ=0 : any function interpolating dataλ=+ : least squares line fit


dttfxfyfRSSN

iii

2

1

2,


Smoothing splines

• smooth.spline(x, y, df, spar, cv,…)– Df degrees of freedom– Spar: penalty parameter– CV=

• TRUE=GCV• FALSE=CV• NA= no CV

plot(m2$Kilometer,m2$Price, main="df=40");res<-smooth.spline( m2$Kilometer, m2$Price,df=40);lines(res, col="blue");


Generalized additive models

A function

of the expected response is additive in the set of inputs, i.e.,

Example: Nonlinear logistic regression of a binary response

)(...)())...,,|(( 111 ppn XsXsXXYEg

))...,,|(( 1 nXXYEg

)()|0(

)|1(log

)|(1

)|(log 0 xs

xXYP

xXYP

xXYE

xXYE


GAM• gam(formula,family=gaussian,data,method="GCV.Cp" select=FALSE, sp)

– Method: method for selection of smoothing parameters– Select: TRUE – variable selection is performed– Sp: smoothing parameters (maximal df)– Formula: usual terms and spline terms s(…)Library: mgcv

• Car properties

• Predict.gam() can be used for predictions

bp<-gam(MPG~s(WT, sp=2)+s(SP, sp=1),data=m3)

vis.gam(bp, theta=10, phi=30);


GAM

Smoothing componentsplot(bp, pages=1)

Principal components analysisIdea: Introduce a new coordinate system (PC1,

PC2, …) where • The first principal component (PC1) is the

direction that maximizes the variance of the projected data

• The second principal component (PC2) is the direction that maximizes the variance of the projected data after the variation along PC1 has been removed

• …

In the new coordinate system, coefficients corresponding to the last principal components are very small can take away this columns


5

10

15

0 5 10

X1

X2

PC1

PC2


Principal components analysis

• princomp(x, ...)

m4<-m3;m4$MODEL<-c();res<-princomp(m4);

loadings(res);plot(res);biplot(res);summary(res);


Decision trees

0

10

20

10 20

X1

X2 X2

0 1 X11

01

<9 >=9

<16 <7>=16 >=7

<15 >=15


Regression tree example


Training-validation-test

• Training-validation (60/40)

• If training-validation-test is required, use similar strategy

sub <- sample(nrow(m2), floor(nrow(m2) * 0.6))training <- m2[sub, ]validation <- m2[-sub, ]


Decision trees by CARTGrowing a full treeLibrary ”tree”.• Create tree: tree(formula, data, subset, split = c("deviance", "gini"),…)

– Subset: if subset of cases needs to be used for training– Split: splitting criterion– More parameters with control parameter

• Prune tree with help of validation set: prune.tree(tree, newdata, method = c("deviance", "misclass”),…)

• Prune tree with cross-validation: cv.tree(object, FUN = prune.tree, K = 10, ...)

– K is number of folds in cross-validation


Classification trees: CART

sub <- sample(nrow(m5), floor(nrow(m5) * 0.6))training <- m5[sub, ]validation <- m5[-sub, ]mytree<-tree(Area~.-Region-X,data=training);summary(mytree)plot(mytree,type="uniform");text(mytree,cex=0.5);

Example: Olive oils in Italy


Classification trees: CART

• Dependence of the misclassification rate on the length of the tree:

treeseq1<-prune.tree(mytree, newdata=validation,method="misclass")plot(treeseq1);title("Validation");treeseq2<-cv.tree(mytree, method="misclass")plot(treeseq2);title("CV");


Regression trees: CART

mytree2<-tree(eicosenoic~linoleic+linolenic+palmitic+palmitoleic,data=training);mytree3<-prune.tree(mytree2, best=4) #totally 4 leavesprint(mytree3)summary(mytree3)plot.tree(mytree3)text(mytree3)


Decision trees: other techniques

• Conditional inference treesLibrary: party

• CART, another library ”rpart”

training$X<-c();training$Area<-c();mytree4<-ctree(Region~.,data=training);print(mytree4)plot(mytree4, type= "simple");# gives nice plots


Neural network

• Input nodes, input layer• [Hidden nodes, Hidden

layer(s)]• Output nodes, output layer• Weights• Activation functions• Combination functions

x1 x2 xp

z1 z2 zM…

…

f1 fK…


Neural networks• Feed –forward NNsLibrary: neuralnet• neuralnet(formula, data, hidden = 1, rep = 1, startweights = NULL, algorithm =

"rprop+", err.fct = "sse", act.fct = "logistic", linear.output = TRUE,…)– Hidden: vector showing amount of hidden neurons at each layer– Rep: amount of runs of network– Startweights: starting weights– Algorithm: ”backprop”, ”rpprop+”, ”sag”, ”slr”– Err.fct: any function +”sse”+”ce” (cross-entropy)– Act.fct:any function+”logistic”+”tanh”– Linear.output: TRUE, if no activation at the output

• confidence.interval(x, alpha = 0.05) Confidence intervals for weights• compute(x, covariate) Prediction• plot(x,…) plot given neural network


Neural networks

• Examplemynet<-neuralnet( Region~eicosenoic+linoleic+linolenic+palmitic, data=training, rep=5, hidden=c(2,2),act.fct="tanh")plot(mynet);mynet$result.matrix


Neural networks

• Prediction with compute()• Finding misclassification rate: table(true_values,predicted

values) – not only for neural networks• Another package, ready for qualitative response (classical

nnet):

mynet1<-nnet( Region~eicosenoic+linoleic, data=training, size=3);coef(mynet1)predict(mynet1, data=validation);


Clustering

• Purpose is to identify groups of observations into intput space (separated)

– K-means– Hierarchical– Density-based


K-means

• Amount of seeds K should be given• Starting seed positions needed

• kmeans(x, centers, iter.max = 10, nstart = 1)– X: data frame– Centers: either ”K” value or set of initial cluster centers– Iter.max: maximum number of iterations res<-kmeans(data.frame (m5$linoleic,

m5$eicosenoic),2);


K-means

• One way to visualizeplot(m5$linoleic, m5$eicosenoic, col=res$cluster);points(res$centers[,1], res$centers[,2], col = 1:2, pch = 8, cex=2)


Hierarchical clustering• Agglomerative

– Place each point into a single cluster– Merge nearest clusters until you get 1 cluster

• Meaning of ”two objects are close”?– Measure of proximity (ex: quantiative vars, Euclidian distance)

• Similarity measure srs (=1 if same object, <1 otherwise)– Ex: correlation

• Dissimilarity measure δrs (=0 if same object, >0 otherwise)– Ex: euclidian distance


Hierarchical clustering

• hclust(d, method = "complete", members=NULL)– D: dissimilarity measure– Method: ”ward”, "single", "complete", "average",

"mcquitty", "median" or "centroid".Returned: a tree showing merging sequence

• cutree(tree, k = NULL, h = NULL)– K: number of clusters to make– H: at which level to cutReturned: cluster index



• Example

x<-data.frame(m5$linolenic, m5$eicosenoic);m5_dist<-dist(x);m5_dend<-hclust(m5_dist, method="complete")plot(m5_dend);



• Example

DO NOT forget to standardize!

clust=cutree(m5_dend, k=2);plot(m5$linoleic, m5$eicosenoic, col=clust);


Density-based clustering

• Kernel-based density estimation.Library: pdfcluster• pdfCluster(x, h = h.norm(x), hmult = 0.75,…)

– X: Data to be partitioned– h: a vector of smoothing parameters– Hmult: shrinkage factor

x<-data.frame(m5$linolenic, m5$eicosenoic);res<-pdfCluster(x);plot(res)


Reference

http://cran.r-project.org/doc/contrib/YanchangZhao-refcard-data-mining.pdf




lecture data mining in r 732a44 programming in r

Documents

b addition

rlogistic regression

b interaction terms

r732a44 programming

predictors family

classesconsider logistic

binomial 732a44 programming

predictor x