lecture data mining in r 732a44 programming in r

33
Lecture Data Mining in R 732A44 Programming in R

Upload: maria-mayfield

Post on 14-Jan-2016

222 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture Data Mining in R 732A44 Programming in R

732A44 Programming in R

Lecture

Data Mining in R

Page 2: Lecture Data Mining in R 732A44 Programming in R

732A44 Programming in R

Logistic regression: two classes• Consider Logistic model with one predictor X=Price of the car

Y=Equipment• Logistic model

• Use function glm(formula, family, data)– Formula: Response~Model

• Model consists of a+b (addition), a:b (interaction terms, a*b (addition and interaction) . All predictors

– Family: specify binomial

)exp(1

)exp()|1(

)|0(1

)|1(log

)|0(

)|1(log

10

10

10

x

xxXYP

xxXYP

xXYP

xXYP

xXYP

Page 3: Lecture Data Mining in R 732A44 Programming in R

732A44 Programming in R

Logistic regression: two classes

reg<-glm(X3...Equipment~Price.in.SEK., family=binomial, data=mydata);

Page 4: Lecture Data Mining in R 732A44 Programming in R

732A44 Programming in R

Logistic regression: several predictors

Data about contraceptive use

– Several analysis plots can be obtained by plot(lrfit)

– Response: matrix success/failure

)exp(1

)exp()|1(

)|0(

)|1(log

1

1

1

0

0

0

x

xx

xx

x

T

T

T

XYP

XYP

XYP

Page 5: Lecture Data Mining in R 732A44 Programming in R

732A44 Programming in R

Logistic regression

Further comments• Nominal logistic regressions (library mlogit, function

mlogit) • Stepwise model selection: step() function.• Prediction: predict() function

Page 6: Lecture Data Mining in R 732A44 Programming in R

Smoothing splines

Minimize a penalized sum of squared residuals

where λ is smoothing parameter.

λ=0 : any function interpolating dataλ=+ : least squares line fit

732A44 Programming in R

dttfxfyfRSSN

iii

2

1

2,

Page 7: Lecture Data Mining in R 732A44 Programming in R

732A44 Programming in R

Smoothing splines

• smooth.spline(x, y, df, spar, cv,…)– Df degrees of freedom– Spar: penalty parameter– CV=

• TRUE=GCV• FALSE=CV• NA= no CV

plot(m2$Kilometer,m2$Price, main="df=40");res<-smooth.spline( m2$Kilometer, m2$Price,df=40);lines(res, col="blue");

Page 8: Lecture Data Mining in R 732A44 Programming in R

732A44 Programming in R

Generalized additive models

A function

of the expected response is additive in the set of inputs, i.e.,

Example: Nonlinear logistic regression of a binary response

)(...)())...,,|(( 111 ppn XsXsXXYEg

))...,,|(( 1 nXXYEg

)()|0(

)|1(log

)|(1

)|(log 0 xs

xXYP

xXYP

xXYE

xXYE

Page 9: Lecture Data Mining in R 732A44 Programming in R

732A44 Programming in R

GAM• gam(formula,family=gaussian,data,method="GCV.Cp" select=FALSE, sp)

– Method: method for selection of smoothing parameters– Select: TRUE – variable selection is performed– Sp: smoothing parameters (maximal df)– Formula: usual terms and spline terms s(…)Library: mgcv

• Car properties

• Predict.gam() can be used for predictions

bp<-gam(MPG~s(WT, sp=2)+s(SP, sp=1),data=m3)

vis.gam(bp, theta=10, phi=30);

Page 10: Lecture Data Mining in R 732A44 Programming in R

732A44 Programming in R

GAM

Smoothing componentsplot(bp, pages=1)

Page 11: Lecture Data Mining in R 732A44 Programming in R

Principal components analysisIdea: Introduce a new coordinate system (PC1,

PC2, …) where • The first principal component (PC1) is the

direction that maximizes the variance of the projected data

• The second principal component (PC2) is the direction that maximizes the variance of the projected data after the variation along PC1 has been removed

• …

In the new coordinate system, coefficients corresponding to the last principal components are very small can take away this columns

732A44 Programming in R

5

10

15

0 5 10

X1

X2

PC1

PC2

Page 12: Lecture Data Mining in R 732A44 Programming in R

732A44 Programming in R

Principal components analysis

• princomp(x, ...)

m4<-m3;m4$MODEL<-c();res<-princomp(m4);

loadings(res);plot(res);biplot(res);summary(res);

Page 13: Lecture Data Mining in R 732A44 Programming in R

732A44 Programming in R

Decision trees

0

10

20

10 20

X1

X2 X2

0 1 X11

01

<9 >=9

<16 <7>=16 >=7

<15 >=15

Page 14: Lecture Data Mining in R 732A44 Programming in R

732A44 Programming in R

Regression tree example

Page 15: Lecture Data Mining in R 732A44 Programming in R

732A44 Programming in R

Training-validation-test

• Training-validation (60/40)

• If training-validation-test is required, use similar strategy

sub <- sample(nrow(m2), floor(nrow(m2) * 0.6))training <- m2[sub, ]validation <- m2[-sub, ]

Page 16: Lecture Data Mining in R 732A44 Programming in R

732A44 Programming in R

Decision trees by CARTGrowing a full treeLibrary ”tree”.• Create tree: tree(formula, data, subset, split = c("deviance", "gini"),…)

– Subset: if subset of cases needs to be used for training– Split: splitting criterion– More parameters with control parameter

• Prune tree with help of validation set: prune.tree(tree, newdata, method = c("deviance", "misclass”),…)

• Prune tree with cross-validation: cv.tree(object, FUN = prune.tree, K = 10, ...)

– K is number of folds in cross-validation

Page 17: Lecture Data Mining in R 732A44 Programming in R

732A44 Programming in R

Classification trees: CART

sub <- sample(nrow(m5), floor(nrow(m5) * 0.6))training <- m5[sub, ]validation <- m5[-sub, ]mytree<-tree(Area~.-Region-X,data=training);summary(mytree)plot(mytree,type="uniform");text(mytree,cex=0.5);

Example: Olive oils in Italy

Page 18: Lecture Data Mining in R 732A44 Programming in R

732A44 Programming in R

Classification trees: CART

• Dependence of the misclassification rate on the length of the tree:

treeseq1<-prune.tree(mytree, newdata=validation,method="misclass")plot(treeseq1);title("Validation");treeseq2<-cv.tree(mytree, method="misclass")plot(treeseq2);title("CV");

Page 19: Lecture Data Mining in R 732A44 Programming in R

732A44 Programming in R

Regression trees: CART

mytree2<-tree(eicosenoic~linoleic+linolenic+palmitic+palmitoleic,data=training);mytree3<-prune.tree(mytree2, best=4) #totally 4 leavesprint(mytree3)summary(mytree3)plot.tree(mytree3)text(mytree3)

Page 20: Lecture Data Mining in R 732A44 Programming in R

732A44 Programming in R

Decision trees: other techniques

• Conditional inference treesLibrary: party

• CART, another library ”rpart”

training$X<-c();training$Area<-c();mytree4<-ctree(Region~.,data=training);print(mytree4)plot(mytree4, type= "simple");# gives nice plots

Page 21: Lecture Data Mining in R 732A44 Programming in R

732A44 Programming in R

Neural network

• Input nodes, input layer• [Hidden nodes, Hidden

layer(s)]• Output nodes, output layer• Weights• Activation functions• Combination functions

x1 x2 xp

z1 z2 zM…

f1 fK…

Page 22: Lecture Data Mining in R 732A44 Programming in R

732A44 Programming in R

Neural networks• Feed –forward NNsLibrary: neuralnet• neuralnet(formula, data, hidden = 1, rep = 1, startweights = NULL, algorithm =

"rprop+", err.fct = "sse", act.fct = "logistic", linear.output = TRUE,…)– Hidden: vector showing amount of hidden neurons at each layer– Rep: amount of runs of network– Startweights: starting weights– Algorithm: ”backprop”, ”rpprop+”, ”sag”, ”slr”– Err.fct: any function +”sse”+”ce” (cross-entropy)– Act.fct:any function+”logistic”+”tanh”– Linear.output: TRUE, if no activation at the output

• confidence.interval(x, alpha = 0.05) Confidence intervals for weights• compute(x, covariate) Prediction• plot(x,…) plot given neural network

Page 23: Lecture Data Mining in R 732A44 Programming in R

732A44 Programming in R

Neural networks

• Examplemynet<-neuralnet( Region~eicosenoic+linoleic+linolenic+palmitic, data=training, rep=5, hidden=c(2,2),act.fct="tanh")plot(mynet);mynet$result.matrix

Page 24: Lecture Data Mining in R 732A44 Programming in R

732A44 Programming in R

Neural networks

• Prediction with compute()• Finding misclassification rate: table(true_values,predicted

values) – not only for neural networks• Another package, ready for qualitative response (classical

nnet):

mynet1<-nnet( Region~eicosenoic+linoleic, data=training, size=3);coef(mynet1)predict(mynet1, data=validation);

Page 25: Lecture Data Mining in R 732A44 Programming in R

732A44 Programming in R

Clustering

• Purpose is to identify groups of observations into intput space (separated)

– K-means– Hierarchical– Density-based

Page 26: Lecture Data Mining in R 732A44 Programming in R

732A44 Programming in R

K-means

• Amount of seeds K should be given• Starting seed positions needed

• kmeans(x, centers, iter.max = 10, nstart = 1)– X: data frame– Centers: either ”K” value or set of initial cluster centers– Iter.max: maximum number of iterations res<-kmeans(data.frame (m5$linoleic,

m5$eicosenoic),2);

Page 27: Lecture Data Mining in R 732A44 Programming in R

732A44 Programming in R

K-means

• One way to visualizeplot(m5$linoleic, m5$eicosenoic, col=res$cluster);points(res$centers[,1], res$centers[,2], col = 1:2, pch = 8, cex=2)

Page 28: Lecture Data Mining in R 732A44 Programming in R

732A44 Programming in R

Hierarchical clustering• Agglomerative

– Place each point into a single cluster– Merge nearest clusters until you get 1 cluster

• Meaning of ”two objects are close”?– Measure of proximity (ex: quantiative vars, Euclidian distance)

• Similarity measure srs (=1 if same object, <1 otherwise)– Ex: correlation

• Dissimilarity measure δrs (=0 if same object, >0 otherwise)– Ex: euclidian distance

Page 29: Lecture Data Mining in R 732A44 Programming in R

732A44 Programming in R

Hierarchical clustering

• hclust(d, method = "complete", members=NULL)– D: dissimilarity measure– Method: ”ward”, "single", "complete", "average",

"mcquitty", "median" or "centroid".Returned: a tree showing merging sequence

• cutree(tree, k = NULL, h = NULL)– K: number of clusters to make– H: at which level to cutReturned: cluster index

Page 30: Lecture Data Mining in R 732A44 Programming in R

732A44 Programming in R

Hierarchical clustering

• Example

x<-data.frame(m5$linolenic, m5$eicosenoic);m5_dist<-dist(x);m5_dend<-hclust(m5_dist, method="complete")plot(m5_dend);

Page 31: Lecture Data Mining in R 732A44 Programming in R

732A44 Programming in R

Hierarchical clustering

• Example

DO NOT forget to standardize!

clust=cutree(m5_dend, k=2);plot(m5$linoleic, m5$eicosenoic, col=clust);

Page 32: Lecture Data Mining in R 732A44 Programming in R

732A44 Programming in R

Density-based clustering

• Kernel-based density estimation.Library: pdfcluster• pdfCluster(x, h = h.norm(x), hmult = 0.75,…)

– X: Data to be partitioned– h: a vector of smoothing parameters– Hmult: shrinkage factor

x<-data.frame(m5$linolenic, m5$eicosenoic);res<-pdfCluster(x);plot(res)