an introduction to data mining with r

An Introduction to Data Mining with R

Yanchang Zhao

http://www.RDataMining.com

6 September 2013

1 / 43

http://www.RDataMining.com

Questions

I Do you know data mining and techniques for it?

I Have you used R before?

I Have you used R in your data mining research or projects?

2 / 43

Outline

Introduction

Classification with R

Clustering with R

Association Rule Mining with R

Text Mining with R

Time Series Analysis with R

Social Network Analysis with R

R and Hadoop

Online Resources

3 / 43

What is R?

I R 1 is a free software environment for statistical computingand graphics.

I R can be easily extended with 4,728 packages available onCRAN2 (as of Sept 6, 2013).

I Many other packages provided on Bioconductor3, R-Forge4,GitHub5, etc.

I R manuals on CRAN6

I An Introduction to RI The R Language DefinitionI R Data Import/ExportI . . .

1http://www.r-project.org/2http://cran.r-project.org/3http://www.bioconductor.org/4http://r-forge.r-project.org/5https://github.com/6http://cran.r-project.org/manuals.html

4 / 43

http://www.r-project.org/

http://cran.r-project.org/

http://www.bioconductor.org/

http://r-forge.r-project.org/

https://github.com/

http://cran.r-project.org/manuals.html

Why R?

I R is widely used in both academia and industry.

I R is ranked no. 1 again in the KDnuggets 2013 poll on TopLanguages for analytics, data mining, data science7.

I The CRAN Task Views 8 provide collections of packages fordifferent tasks.

I Machine learning & atatistical learningI Cluster analysis & finite mixture modelsI Time series analysisI Multivariate statisticsI Analysis of spatial dataI . . .

7http://www.kdnuggets.com/2013/08/languages-for-analytics-data-mining-data-science.html

8http://cran.r-project.org/web/views/

5 / 43

http://www.kdnuggets.com/2013/08/languages-for-analytics-data-mining-data-science.html

http://cran.r-project.org/web/views/

Outline

Introduction


Clustering with R


Text Mining with R



R and Hadoop

Online Resources

6 / 43


I Decision trees: rpart, party

I Random forest: randomForest, party

I SVM: e1071, kernlab

I Neural networks: nnet, neuralnet, RSNNS

I Performance evaluation: ROCR

7 / 43

The Iris Dataset

# iris data

str(iris)

## 'data.frame': 150 obs. of 5 variables:

## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...

## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...

## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...

## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...

## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

# split into training and test datasets

set.seed(1234)

ind <- sample(2, nrow(iris), replace=T, prob=c(0.7, 0.3))

iris.train <- iris[ind==1, ]

iris.test <- iris[ind==2, ]

8 / 43

Build a Decision Tree

# build a decision tree

library(party)

iris.formula <- Species ~ Sepal.Length + Sepal.Width +

Petal.Length + Petal.Width

iris.ctree <- ctree(iris.formula, data=iris.train)

9 / 43

plot(iris.ctree)

Petal.Lengthp < 0.001

1

≤ 1.9 > 1.9

Node 2 (n = 40)

setosa

0

0.2

0.4

0.6

0.8

1

Petal.Widthp < 0.001

3

≤ 1.7 > 1.7

Petal.Lengthp = 0.026

4

≤ 4.4 > 4.4

Node 5 (n = 21)

setosa

0

0.2

0.4

0.6

0.8

1

Node 6 (n = 19)

setosa

0

0.2

0.4

0.6

0.8

1

Node 7 (n = 32)

setosa

0

0.2

0.4

0.6

0.8

1

10 / 43

Prediction

# predict on test data

pred <- predict(iris.ctree, newdata = iris.test)

# check prediction result

table(pred, iris.test$Species)

##

## pred setosa versicolor virginica

## setosa 10 0 0

## versicolor 0 12 2

## virginica 0 0 14

11 / 43

Outline

Introduction


Clustering with R


Text Mining with R



R and Hadoop

Online Resources

12 / 43

Clustering with R

I k-means: kmeans(), kmeansruns()9

I k-medoids: pam(), pamk()

I Hierarchical clustering: hclust(), agnes(), diana()

I DBSCAN: fpc

I BIRCH: birch

9Functions are followed with “()”, and others are packages.13 / 43

k-means Clustering

set.seed(8953)

iris2 <- iris

# remove class IDs

iris2$Species <- NULL

# k-means clustering

iris.kmeans <- kmeans(iris2, 3)

# check result

table(iris$Species, iris.kmeans$cluster)

##

## 1 2 3

## setosa 0 50 0

## versicolor 2 0 48

## virginica 36 0 14

14 / 43

# plot clusters and their centers

plot(iris2[c("Sepal.Length", "Sepal.Width")], col=iris.kmeans$cluster)

points(iris.kmeans$centers[, c("Sepal.Length", "Sepal.Width")],

col=1:3, pch="*", cex=5)

4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0

2.0

2.5

3.0

3.5

4.0

Sepal.Length

Sep

al.W

idth

**

*

15 / 43

Density-based Clustering

library(fpc)

iris2 <- iris[-5] # remove class IDs

# DBSCAN clustering

ds <- dbscan(iris2, eps = 0.42, MinPts = 5)

# compare clusters with original class IDs

table(ds$cluster, iris$Species)

##

## setosa versicolor virginica

## 0 2 10 17

## 1 48 0 0

## 2 0 37 0

## 3 0 3 33

16 / 43

# 1-3: clusters; 0: outliers or noise

plotcluster(iris2, ds$cluster)

1

1

1

1

1

11

11

1

11

1

1

1

11

1

1

1

1

10

1

11

1

11

11

11

1

1

111 1

1

1

0

1

1

1

11

11

1

2

2

222

2

2

0

2

2

0

2

0

2

0

2

2

2

02

3

2

3

2

2

2

2

22

02

2

23

2

0

2

0

2

2

2

2

20

22

2

2

0

2

0

3

3

3

3

0

0

0

0

0

3

3

33

0

3

3

0

00

33

0

3

3

0

33

3

0

0

0

3

3

0

0

3

3

3 3

33

3

3

3

3

3

3

3

3

−8 −6 −4 −2 0 2

−2

−1

01

23

dc 1

dc 2

17 / 43

Outline

Introduction


Clustering with R


Text Mining with R



R and Hadoop

Online Resources

18 / 43


I Association rules: apriori(), eclat() in package arules

I Sequential patterns: arulesSequence

I Visualisation of associations: arulesViz

19 / 43

The Titanic Dataset

load("./data/titanic.raw.rdata")

dim(titanic.raw)

## [1] 2201 4

idx <- sample(1:nrow(titanic.raw), 8)

titanic.raw[idx, ]

## Class Sex Age Survived

## 501 3rd Male Adult No



## 766 Crew Male Adult No

## 1485 3rd Female Adult No

## 1388 2nd Female Adult No



20 / 43

Association Rule Mining

# find association rules with the APRIORI algorithm

library(arules)

rules <- apriori(titanic.raw, control=list(verbose=F),

parameter=list(minlen=2, supp=0.005, conf=0.8),

appearance=list(rhs=c("Survived=No", "Survived=Yes"),

default="lhs"))

# sort rules

quality(rules) <- round(quality(rules), digits=3)

rules.sorted <- sort(rules, by="lift")

# have a look at rules

# inspect(rules.sorted)

21 / 43

# lhs rhs support confidence lift

# 1 {Class=2nd,# Age=Child} => {Survived=Yes} 0.011 1.000 3.096

# 2 {Class=2nd,# Sex=Female,

# Age=Child} => {Survived=Yes} 0.006 1.000 3.096

# 3 {Class=1st,# Sex=Female} => {Survived=Yes} 0.064 0.972 3.010

# 4 {Class=1st,# Sex=Female,

# Age=Adult} => {Survived=Yes} 0.064 0.972 3.010

# 5 {Class=2nd,# Sex=Male,

# Age=Adult} => {Survived=No} 0.070 0.917 1.354

# 6 {Class=2nd,# Sex=Female} => {Survived=Yes} 0.042 0.877 2.716

# 7 {Class=Crew,# Sex=Female} => {Survived=Yes} 0.009 0.870 2.692

# 8 {Class=Crew,# Sex=Female,


# 9 {Class=2nd,# Sex=Male} => {Survived=No} 0.070 0.860 1.271

# 10 {Class=2nd,# Sex=Female,


22 / 43

library(arulesViz)

plot(rules, method = "graph")

Graph for 12 rules

{Class=1st,Sex=Female,Age=Adult}

{Class=1st,Sex=Female}

{Class=2nd,Age=Child}

{Class=2nd,Sex=Female,Age=Adult}

{Class=2nd,Sex=Female,Age=Child}

{Class=2nd,Sex=Female}

{Class=2nd,Sex=Male,Age=Adult}

{Class=2nd,Sex=Male}

{Class=3rd,Sex=Male,Age=Adult}

{Class=3rd,Sex=Male}

{Class=Crew,Sex=Female,Age=Adult}

{Class=Crew,Sex=Female}

{Survived=No}

{Survived=Yes}

width: support (0.006 − 0.192)color: lift (1.222 − 3.096)

23 / 43

Outline

Introduction


Clustering with R


Text Mining with R



R and Hadoop

Online Resources

24 / 43

Text Mining with R

I Text mining: tm

I Topic modelling: topicmodels, lda

I Word cloud: wordcloud

I Twitter data access: twitteR

25 / 43

Fetch Twitter Data

## retrieve tweets from the user timeline of @rdatammining

library(twitteR)

# tweets <- userTimeline('rdatamining')

load(file = "./data/rdmTweets.RData")

(nDocs <- length(tweets))

## [1] 320

strwrap(tweets[[320]]$text, width = 50)

## [1] "An R Reference Card for Data Mining is now"

## [2] "available on CRAN. It lists many useful R"

## [3] "functions and packages for data mining"

## [4] "applications."

# convert tweets to a data frame

df <- do.call("rbind", lapply(tweets, as.data.frame))

26 / 43

Text Cleaning

library(tm)

# build a corpus

myCorpus <- Corpus(VectorSource(df$text))

# convert to lower case

myCorpus <- tm_map(myCorpus, tolower)

# remove punctuation & numbers

myCorpus <- tm_map(myCorpus, removePunctuation)

myCorpus <- tm_map(myCorpus, removeNumbers)

# remove URLs

removeURL <- function(x) gsub("http[[:alnum:]]*", "", x)

myCorpus <- tm_map(myCorpus, removeURL)

# remove 'r' and 'big' from stopwords

myStopwords <- setdiff(stopwords("english"), c("r", "big"))

# remove stopwords

myCorpus <- tm_map(myCorpus, removeWords, myStopwords)

27 / 43

Stemming

# keep a copy of corpus

myCorpusCopy <- myCorpus

# stem words

myCorpus <- tm_map(myCorpus, stemDocument)

# stem completion

myCorpus <- tm_map(myCorpus, stemCompletion,

dictionary = myCorpusCopy)

# replace "miners" with "mining", because "mining" was

# first stemmed to "mine" and then completed to "miners"

myCorpus <- tm_map(myCorpus, gsub, pattern="miners",

replacement="mining")

strwrap(myCorpus[320], width=50)

## [1] "r reference card data mining now available cran"

## [2] "list used r functions package data mining"

## [3] "applications"

28 / 43

Frequent Terms

myTdm <- TermDocumentMatrix(myCorpus,

control=list(wordLengths=c(1,Inf)))

# inspect frequent words

(freq.terms <- findFreqTerms(myTdm, lowfreq=20))

## [1] "analysis" "big" "computing"

## [4] "data" "examples" "mining"

## [7] "network" "package" "position"

## [10] "postdoctoral" "r" "research"

## [13] "slides" "social" "tutorial"

## [16] "university" "used"

29 / 43

Associations

# which words are associated with 'r'?

findAssocs(myTdm, "r", 0.2)

## examples code package

## 0.32 0.29 0.20

# which words are associated with 'mining'?

findAssocs(myTdm, "mining", 0.25)

## data mahout recommendation sets

## 0.47 0.30 0.30 0.30

## supports frequent itemset

## 0.30 0.26 0.26

30 / 43

Network of Terms

library(graph)

library(Rgraphviz)

plot(myTdm, term=freq.terms, corThreshold=0.1, weighting=T)

analysis big computing

data

examples

mining

network

packageposition

postdoctoral r

research

slides

socialtutorial

university

used

31 / 43

Word Cloud

library(wordcloud)

m <- as.matrix(myTdm)

freq <- sort(rowSums(m), decreasing=T)

wordcloud(words=names(freq), freq=freq, min.freq=4, random.order=F)

rdata

mininganalysisexamplesresearch

package

position

slides usednetwork

university

postdoctoral

socialtutorial

big

computing

applications

book

codeintroduction

see group

analytics

australia

available

modellingscientist

time

association

free

job

lect

ure

online

parallel

text

clustering

coursele

arn

pdf

rules

series

talk

docu

men

t

now

programming

statistics

ausdmdetection

google

large

openoutlier

rdatamining

techniques

thanks

tools

users

amp

chapter

due

graphics

map

presentation science

software

starting

studies

vaca

ncy

via visualizing

business

call

cancase

center

cfp

data

base

deta

ils

followerfunctions

join

kdnuggets

page

poll

spatial

submission

technology

twitter

anal

yst

california

card

classification

fast

find

get

graph

handling

information

interactive

linkedin

list

machine

melbourne

notes

processing

provided

recent

reference

tried

using

videos

web

workshop

wwwrdataminingcom

access

added

canada

canberra

chin

a

clou

d

conference

datasets

distributed

dmapps

draft

events

experience

fellow

forecasting

frequent

high

ibm

industrial

knowledge

management

nd

performance

pred

ictiv

e

publ

ishe

d

search

sentiment

short

snowfall

southern

sydney

tenuretrack

top

topic

week

youtube

32 / 43

Topic Modelling

library(topicmodels)

set.seed(123)

myLda <- LDA(as.DocumentTermMatrix(myTdm), k=8)

terms(myLda, 5)

## Topic 1 Topic 2 Topic 3 Topic 4

## [1,] "data" "r" "r" "research"

## [2,] "mining" "package" "time" "position"

## [3,] "big" "examples" "series" "data"

## [4,] "association" "used" "users" "university"

## [5,] "rules" "code" "talk" "postdoctoral"

## Topic 5 Topic 6 Topic 7 Topic 8

## [1,] "mining" "group" "data" "analysis"

## [2,] "data" "data" "r" "network"

## [3,] "slides" "used" "mining" "social"

## [4,] "modelling" "software" "analysis" "text"

## [5,] "tools" "kdnuggets" "book" "slides"

33 / 43

Outline

Introduction


Clustering with R


Text Mining with R



R and Hadoop

Online Resources

34 / 43


I Time series decomposition: decomp(), decompose(), arima(),stl()

I Time series forecasting: forecast

I Time Series Clustering: TSclust

I Dynamic Time Warping (DTW): dtw

35 / 43

Outline

Introduction


Clustering with R


Text Mining with R



R and Hadoop

Online Resources

36 / 43


I Packages: igraph, sna

I Centrality measures: degree(), betweenness(), closeness(),transitivity()

I Clusters: clusters(), no.clusters()

I Cliques: cliques(), largest.cliques(), maximal.cliques(),clique.number()

I Community detection: fastgreedy.community(),spinglass.community()

37 / 43

Outline

Introduction


Clustering with R


Text Mining with R



R and Hadoop

Online Resources

38 / 43

R and Hadoop

I Packages: RHadoop, RHive

I RHadoop10 is a collection of 3 R packages:

I rmr2 - perform data analysis with R via MapReduce on aHadoop cluster

I rhdfs - connect to Hadoop Distributed File System (HDFS)I rhbase - connect to the NoSQL HBase database

I You can play with it on a single PC (in standalone orpseudo-distributed mode), and your code developed on thatwill be able to work on a cluster of PCs (in full-distributedmode)!

I Step by step to set up my first R Hadoop systemhttp://www.rdatamining.com/tutorials/rhadoop

10https://github.com/RevolutionAnalytics/RHadoop/wiki39 / 43

http://www.rdatamining.com/tutorials/rhadoop

https://github.com/RevolutionAnalytics/RHadoop/wiki

An Example of MapReducing with R

library(rmr2)

map <- function(k, lines) {words.list <- strsplit(lines, "\\s")words <- unlist(words.list)

return(keyval(words, 1))

}reduce <- function(word, counts) {

keyval(word, sum(counts))

}wordcount <- function(input, output = NULL) {

mapreduce(input = input, output = output, input.format = "text",

map = map, reduce = reduce)

}## Submit job

out <- wordcount(in.file.path, out.file.path)

11

11From Jeffrey Breen’s presentation on Using R with Hadoophttp://www.revolutionanalytics.com/news-events/free-webinars/2013/using-r-with-hadoop/

40 / 43

http://www.revolutionanalytics.com/news-events/free-webinars/2013/using-r-with-hadoop/

Outline

Introduction


Clustering with R


Text Mining with R



R and Hadoop

Online Resources

41 / 43

Online Resources

I RDataMining websitehttp://www.rdatamining.com

I R Reference Card for Data MiningI R and Data Mining: Examples and Case Studies

I RDataMining Group on LinkedIn (3100+ members)http://group.rdatamining.com

I RDataMining on Twitter (1200+ followers)http://twitter.com/rdatamining

I Free online courseshttp://www.rdatamining.com/resources/courses

I Online documentshttp://www.rdatamining.com/resources/onlinedocs

42 / 43

http://www.rdatamining.com

http://group.rdatamining.com

http://twitter.com/rdatamining

http://www.rdatamining.com/resources/courses

http://www.rdatamining.com/resources/onlinedocs

The End

Thanks!

Email: yanchang(at)rdatamining.com

43 / 43

http://www.rdatamining.com/books/rdm

http://www.rdatamining.com/books/dmar

an introduction to data mining with r

Technology

r clustering

r manuals

r text mining

r association rule mining

r time series analysis

r social network analysis

r yanchang zhao http

data mining research