an introduction to data mining with r

45
An Introduction to Data Mining with R Yanchang Zhao http://www.RDataMining.com 6 September 2013 1 / 43

Upload: yanchang-zhao

Post on 27-Jan-2015

126 views

Category:

Technology


11 download

DESCRIPTION

Slides of a talk on Introduction to Data Mining with R at University of Canberra, Sept 2013

TRANSCRIPT

Page 1: An Introduction to Data Mining with R

An Introduction to Data Mining with R

Yanchang Zhao

http://www.RDataMining.com

6 September 2013

1 / 43

Page 2: An Introduction to Data Mining with R

Questions

I Do you know data mining and techniques for it?

I Have you used R before?

I Have you used R in your data mining research or projects?

2 / 43

Page 3: An Introduction to Data Mining with R

Questions

I Do you know data mining and techniques for it?

I Have you used R before?

I Have you used R in your data mining research or projects?

2 / 43

Page 4: An Introduction to Data Mining with R

Questions

I Do you know data mining and techniques for it?

I Have you used R before?

I Have you used R in your data mining research or projects?

2 / 43

Page 5: An Introduction to Data Mining with R

Outline

Introduction

Classification with R

Clustering with R

Association Rule Mining with R

Text Mining with R

Time Series Analysis with R

Social Network Analysis with R

R and Hadoop

Online Resources

3 / 43

Page 6: An Introduction to Data Mining with R

What is R?

I R 1 is a free software environment for statistical computingand graphics.

I R can be easily extended with 4,728 packages available onCRAN2 (as of Sept 6, 2013).

I Many other packages provided on Bioconductor3, R-Forge4,GitHub5, etc.

I R manuals on CRAN6

I An Introduction to RI The R Language DefinitionI R Data Import/ExportI . . .

1http://www.r-project.org/2http://cran.r-project.org/3http://www.bioconductor.org/4http://r-forge.r-project.org/5https://github.com/6http://cran.r-project.org/manuals.html

4 / 43

Page 7: An Introduction to Data Mining with R

Why R?

I R is widely used in both academia and industry.

I R is ranked no. 1 again in the KDnuggets 2013 poll on TopLanguages for analytics, data mining, data science7.

I The CRAN Task Views 8 provide collections of packages fordifferent tasks.

I Machine learning & atatistical learningI Cluster analysis & finite mixture modelsI Time series analysisI Multivariate statisticsI Analysis of spatial dataI . . .

7http://www.kdnuggets.com/2013/08/languages-for-analytics-data-mining-data-science.html

8http://cran.r-project.org/web/views/

5 / 43

Page 8: An Introduction to Data Mining with R

Outline

Introduction

Classification with R

Clustering with R

Association Rule Mining with R

Text Mining with R

Time Series Analysis with R

Social Network Analysis with R

R and Hadoop

Online Resources

6 / 43

Page 9: An Introduction to Data Mining with R

Classification with R

I Decision trees: rpart, party

I Random forest: randomForest, party

I SVM: e1071, kernlab

I Neural networks: nnet, neuralnet, RSNNS

I Performance evaluation: ROCR

7 / 43

Page 10: An Introduction to Data Mining with R

The Iris Dataset

# iris data

str(iris)

## 'data.frame': 150 obs. of 5 variables:

## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...

## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...

## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...

## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...

## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

# split into training and test datasets

set.seed(1234)

ind <- sample(2, nrow(iris), replace=T, prob=c(0.7, 0.3))

iris.train <- iris[ind==1, ]

iris.test <- iris[ind==2, ]

8 / 43

Page 11: An Introduction to Data Mining with R

Build a Decision Tree

# build a decision tree

library(party)

iris.formula <- Species ~ Sepal.Length + Sepal.Width +

Petal.Length + Petal.Width

iris.ctree <- ctree(iris.formula, data=iris.train)

9 / 43

Page 12: An Introduction to Data Mining with R

plot(iris.ctree)

Petal.Lengthp < 0.001

1

≤ 1.9 > 1.9

Node 2 (n = 40)

setosa

0

0.2

0.4

0.6

0.8

1

Petal.Widthp < 0.001

3

≤ 1.7 > 1.7

Petal.Lengthp = 0.026

4

≤ 4.4 > 4.4

Node 5 (n = 21)

setosa

0

0.2

0.4

0.6

0.8

1

Node 6 (n = 19)

setosa

0

0.2

0.4

0.6

0.8

1

Node 7 (n = 32)

setosa

0

0.2

0.4

0.6

0.8

1

10 / 43

Page 13: An Introduction to Data Mining with R

Prediction

# predict on test data

pred <- predict(iris.ctree, newdata = iris.test)

# check prediction result

table(pred, iris.test$Species)

##

## pred setosa versicolor virginica

## setosa 10 0 0

## versicolor 0 12 2

## virginica 0 0 14

11 / 43

Page 14: An Introduction to Data Mining with R

Outline

Introduction

Classification with R

Clustering with R

Association Rule Mining with R

Text Mining with R

Time Series Analysis with R

Social Network Analysis with R

R and Hadoop

Online Resources

12 / 43

Page 15: An Introduction to Data Mining with R

Clustering with R

I k-means: kmeans(), kmeansruns()9

I k-medoids: pam(), pamk()

I Hierarchical clustering: hclust(), agnes(), diana()

I DBSCAN: fpc

I BIRCH: birch

9Functions are followed with “()”, and others are packages.13 / 43

Page 16: An Introduction to Data Mining with R

k-means Clustering

set.seed(8953)

iris2 <- iris

# remove class IDs

iris2$Species <- NULL

# k-means clustering

iris.kmeans <- kmeans(iris2, 3)

# check result

table(iris$Species, iris.kmeans$cluster)

##

## 1 2 3

## setosa 0 50 0

## versicolor 2 0 48

## virginica 36 0 14

14 / 43

Page 17: An Introduction to Data Mining with R

# plot clusters and their centers

plot(iris2[c("Sepal.Length", "Sepal.Width")], col=iris.kmeans$cluster)

points(iris.kmeans$centers[, c("Sepal.Length", "Sepal.Width")],

col=1:3, pch="*", cex=5)

4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0

2.0

2.5

3.0

3.5

4.0

Sepal.Length

Sep

al.W

idth

**

*

15 / 43

Page 18: An Introduction to Data Mining with R

Density-based Clustering

library(fpc)

iris2 <- iris[-5] # remove class IDs

# DBSCAN clustering

ds <- dbscan(iris2, eps = 0.42, MinPts = 5)

# compare clusters with original class IDs

table(ds$cluster, iris$Species)

##

## setosa versicolor virginica

## 0 2 10 17

## 1 48 0 0

## 2 0 37 0

## 3 0 3 33

16 / 43

Page 19: An Introduction to Data Mining with R

# 1-3: clusters; 0: outliers or noise

plotcluster(iris2, ds$cluster)

1

1

1

1

1

11

11

1

11

1

1

1

11

1

1

1

1

10

1

11

1

11

11

11

1

1

111 1

1

1

0

1

1

1

11

11

1

2

2

222

2

2

0

2

2

0

2

0

2

0

2

2

2

02

3

2

3

2

2

2

2

22

02

2

23

2

0

2

0

2

2

2

2

20

22

2

2

0

2

0

3

3

3

3

0

0

0

0

0

3

3

33

0

3

3

0

00

33

0

3

3

0

33

3

0

0

0

3

3

0

0

3

3

3 3

33

3

3

3

3

3

3

3

3

−8 −6 −4 −2 0 2

−2

−1

01

23

dc 1

dc 2

17 / 43

Page 20: An Introduction to Data Mining with R

Outline

Introduction

Classification with R

Clustering with R

Association Rule Mining with R

Text Mining with R

Time Series Analysis with R

Social Network Analysis with R

R and Hadoop

Online Resources

18 / 43

Page 21: An Introduction to Data Mining with R

Association Rule Mining with R

I Association rules: apriori(), eclat() in package arules

I Sequential patterns: arulesSequence

I Visualisation of associations: arulesViz

19 / 43

Page 22: An Introduction to Data Mining with R

The Titanic Dataset

load("./data/titanic.raw.rdata")

dim(titanic.raw)

## [1] 2201 4

idx <- sample(1:nrow(titanic.raw), 8)

titanic.raw[idx, ]

## Class Sex Age Survived

## 501 3rd Male Adult No

## 477 3rd Male Adult No

## 674 3rd Male Adult No

## 766 Crew Male Adult No

## 1485 3rd Female Adult No

## 1388 2nd Female Adult No

## 448 3rd Male Adult No

## 590 3rd Male Adult No

20 / 43

Page 23: An Introduction to Data Mining with R

Association Rule Mining

# find association rules with the APRIORI algorithm

library(arules)

rules <- apriori(titanic.raw, control=list(verbose=F),

parameter=list(minlen=2, supp=0.005, conf=0.8),

appearance=list(rhs=c("Survived=No", "Survived=Yes"),

default="lhs"))

# sort rules

quality(rules) <- round(quality(rules), digits=3)

rules.sorted <- sort(rules, by="lift")

# have a look at rules

# inspect(rules.sorted)

21 / 43

Page 24: An Introduction to Data Mining with R

# lhs rhs support confidence lift

# 1 {Class=2nd,# Age=Child} => {Survived=Yes} 0.011 1.000 3.096

# 2 {Class=2nd,# Sex=Female,

# Age=Child} => {Survived=Yes} 0.006 1.000 3.096

# 3 {Class=1st,# Sex=Female} => {Survived=Yes} 0.064 0.972 3.010

# 4 {Class=1st,# Sex=Female,

# Age=Adult} => {Survived=Yes} 0.064 0.972 3.010

# 5 {Class=2nd,# Sex=Male,

# Age=Adult} => {Survived=No} 0.070 0.917 1.354

# 6 {Class=2nd,# Sex=Female} => {Survived=Yes} 0.042 0.877 2.716

# 7 {Class=Crew,# Sex=Female} => {Survived=Yes} 0.009 0.870 2.692

# 8 {Class=Crew,# Sex=Female,

# Age=Adult} => {Survived=Yes} 0.009 0.870 2.692

# 9 {Class=2nd,# Sex=Male} => {Survived=No} 0.070 0.860 1.271

# 10 {Class=2nd,# Sex=Female,

# Age=Adult} => {Survived=Yes} 0.036 0.860 2.663

22 / 43

Page 25: An Introduction to Data Mining with R

library(arulesViz)

plot(rules, method = "graph")

Graph for 12 rules

{Class=1st,Sex=Female,Age=Adult}

{Class=1st,Sex=Female}

{Class=2nd,Age=Child}

{Class=2nd,Sex=Female,Age=Adult}

{Class=2nd,Sex=Female,Age=Child}

{Class=2nd,Sex=Female}

{Class=2nd,Sex=Male,Age=Adult}

{Class=2nd,Sex=Male}

{Class=3rd,Sex=Male,Age=Adult}

{Class=3rd,Sex=Male}

{Class=Crew,Sex=Female,Age=Adult}

{Class=Crew,Sex=Female}

{Survived=No}

{Survived=Yes}

width: support (0.006 − 0.192)color: lift (1.222 − 3.096)

23 / 43

Page 26: An Introduction to Data Mining with R

Outline

Introduction

Classification with R

Clustering with R

Association Rule Mining with R

Text Mining with R

Time Series Analysis with R

Social Network Analysis with R

R and Hadoop

Online Resources

24 / 43

Page 27: An Introduction to Data Mining with R

Text Mining with R

I Text mining: tm

I Topic modelling: topicmodels, lda

I Word cloud: wordcloud

I Twitter data access: twitteR

25 / 43

Page 28: An Introduction to Data Mining with R

Fetch Twitter Data

## retrieve tweets from the user timeline of @rdatammining

library(twitteR)

# tweets <- userTimeline('rdatamining')

load(file = "./data/rdmTweets.RData")

(nDocs <- length(tweets))

## [1] 320

strwrap(tweets[[320]]$text, width = 50)

## [1] "An R Reference Card for Data Mining is now"

## [2] "available on CRAN. It lists many useful R"

## [3] "functions and packages for data mining"

## [4] "applications."

# convert tweets to a data frame

df <- do.call("rbind", lapply(tweets, as.data.frame))

26 / 43

Page 29: An Introduction to Data Mining with R

Text Cleaning

library(tm)

# build a corpus

myCorpus <- Corpus(VectorSource(df$text))

# convert to lower case

myCorpus <- tm_map(myCorpus, tolower)

# remove punctuation & numbers

myCorpus <- tm_map(myCorpus, removePunctuation)

myCorpus <- tm_map(myCorpus, removeNumbers)

# remove URLs

removeURL <- function(x) gsub("http[[:alnum:]]*", "", x)

myCorpus <- tm_map(myCorpus, removeURL)

# remove 'r' and 'big' from stopwords

myStopwords <- setdiff(stopwords("english"), c("r", "big"))

# remove stopwords

myCorpus <- tm_map(myCorpus, removeWords, myStopwords)

27 / 43

Page 30: An Introduction to Data Mining with R

Stemming

# keep a copy of corpus

myCorpusCopy <- myCorpus

# stem words

myCorpus <- tm_map(myCorpus, stemDocument)

# stem completion

myCorpus <- tm_map(myCorpus, stemCompletion,

dictionary = myCorpusCopy)

# replace "miners" with "mining", because "mining" was

# first stemmed to "mine" and then completed to "miners"

myCorpus <- tm_map(myCorpus, gsub, pattern="miners",

replacement="mining")

strwrap(myCorpus[320], width=50)

## [1] "r reference card data mining now available cran"

## [2] "list used r functions package data mining"

## [3] "applications"

28 / 43

Page 31: An Introduction to Data Mining with R

Frequent Terms

myTdm <- TermDocumentMatrix(myCorpus,

control=list(wordLengths=c(1,Inf)))

# inspect frequent words

(freq.terms <- findFreqTerms(myTdm, lowfreq=20))

## [1] "analysis" "big" "computing"

## [4] "data" "examples" "mining"

## [7] "network" "package" "position"

## [10] "postdoctoral" "r" "research"

## [13] "slides" "social" "tutorial"

## [16] "university" "used"

29 / 43

Page 32: An Introduction to Data Mining with R

Associations

# which words are associated with 'r'?

findAssocs(myTdm, "r", 0.2)

## examples code package

## 0.32 0.29 0.20

# which words are associated with 'mining'?

findAssocs(myTdm, "mining", 0.25)

## data mahout recommendation sets

## 0.47 0.30 0.30 0.30

## supports frequent itemset

## 0.30 0.26 0.26

30 / 43

Page 33: An Introduction to Data Mining with R

Network of Terms

library(graph)

library(Rgraphviz)

plot(myTdm, term=freq.terms, corThreshold=0.1, weighting=T)

analysis big computing

data

examples

mining

network

packageposition

postdoctoral r

research

slides

socialtutorial

university

used

31 / 43

Page 34: An Introduction to Data Mining with R

Word Cloud

library(wordcloud)

m <- as.matrix(myTdm)

freq <- sort(rowSums(m), decreasing=T)

wordcloud(words=names(freq), freq=freq, min.freq=4, random.order=F)

rdata

mininganalysisexamplesresearch

package

position

slides usednetwork

university

postdoctoral

socialtutorial

big

computing

applications

book

codeintroduction

see group

analytics

australia

available

modellingscientist

time

association

free

job

lect

ure

online

parallel

text

clustering

coursele

arn

pdf

rules

series

talk

docu

men

t

now

programming

statistics

ausdmdetection

google

large

openoutlier

rdatamining

techniques

thanks

tools

users

amp

chapter

due

graphics

map

presentation science

software

starting

studies

vaca

ncy

via visualizing

business

call

cancase

center

cfp

data

base

deta

ils

followerfunctions

join

kdnuggets

page

poll

spatial

submission

technology

twitter

anal

yst

california

card

classification

fast

find

get

graph

handling

information

interactive

linkedin

list

machine

melbourne

notes

processing

provided

recent

reference

tried

using

videos

web

workshop

wwwrdataminingcom

access

added

canada

canberra

chin

a

clou

d

conference

datasets

distributed

dmapps

draft

events

experience

fellow

forecasting

frequent

high

ibm

industrial

knowledge

management

nd

performance

pred

ictiv

e

publ

ishe

d

search

sentiment

short

snowfall

southern

sydney

tenuretrack

top

topic

week

youtube

32 / 43

Page 35: An Introduction to Data Mining with R

Topic Modelling

library(topicmodels)

set.seed(123)

myLda <- LDA(as.DocumentTermMatrix(myTdm), k=8)

terms(myLda, 5)

## Topic 1 Topic 2 Topic 3 Topic 4

## [1,] "data" "r" "r" "research"

## [2,] "mining" "package" "time" "position"

## [3,] "big" "examples" "series" "data"

## [4,] "association" "used" "users" "university"

## [5,] "rules" "code" "talk" "postdoctoral"

## Topic 5 Topic 6 Topic 7 Topic 8

## [1,] "mining" "group" "data" "analysis"

## [2,] "data" "data" "r" "network"

## [3,] "slides" "used" "mining" "social"

## [4,] "modelling" "software" "analysis" "text"

## [5,] "tools" "kdnuggets" "book" "slides"

33 / 43

Page 36: An Introduction to Data Mining with R

Outline

Introduction

Classification with R

Clustering with R

Association Rule Mining with R

Text Mining with R

Time Series Analysis with R

Social Network Analysis with R

R and Hadoop

Online Resources

34 / 43

Page 37: An Introduction to Data Mining with R

Time Series Analysis with R

I Time series decomposition: decomp(), decompose(), arima(),stl()

I Time series forecasting: forecast

I Time Series Clustering: TSclust

I Dynamic Time Warping (DTW): dtw

35 / 43

Page 38: An Introduction to Data Mining with R

Outline

Introduction

Classification with R

Clustering with R

Association Rule Mining with R

Text Mining with R

Time Series Analysis with R

Social Network Analysis with R

R and Hadoop

Online Resources

36 / 43

Page 39: An Introduction to Data Mining with R

Social Network Analysis with R

I Packages: igraph, sna

I Centrality measures: degree(), betweenness(), closeness(),transitivity()

I Clusters: clusters(), no.clusters()

I Cliques: cliques(), largest.cliques(), maximal.cliques(),clique.number()

I Community detection: fastgreedy.community(),spinglass.community()

37 / 43

Page 40: An Introduction to Data Mining with R

Outline

Introduction

Classification with R

Clustering with R

Association Rule Mining with R

Text Mining with R

Time Series Analysis with R

Social Network Analysis with R

R and Hadoop

Online Resources

38 / 43

Page 41: An Introduction to Data Mining with R

R and Hadoop

I Packages: RHadoop, RHive

I RHadoop10 is a collection of 3 R packages:

I rmr2 - perform data analysis with R via MapReduce on aHadoop cluster

I rhdfs - connect to Hadoop Distributed File System (HDFS)I rhbase - connect to the NoSQL HBase database

I You can play with it on a single PC (in standalone orpseudo-distributed mode), and your code developed on thatwill be able to work on a cluster of PCs (in full-distributedmode)!

I Step by step to set up my first R Hadoop systemhttp://www.rdatamining.com/tutorials/rhadoop

10https://github.com/RevolutionAnalytics/RHadoop/wiki39 / 43

Page 42: An Introduction to Data Mining with R

An Example of MapReducing with R

library(rmr2)

map <- function(k, lines) {words.list <- strsplit(lines, "\\s")words <- unlist(words.list)

return(keyval(words, 1))

}reduce <- function(word, counts) {

keyval(word, sum(counts))

}wordcount <- function(input, output = NULL) {

mapreduce(input = input, output = output, input.format = "text",

map = map, reduce = reduce)

}## Submit job

out <- wordcount(in.file.path, out.file.path)

11

11From Jeffrey Breen’s presentation on Using R with Hadoophttp://www.revolutionanalytics.com/news-events/free-webinars/2013/using-r-with-hadoop/

40 / 43

Page 43: An Introduction to Data Mining with R

Outline

Introduction

Classification with R

Clustering with R

Association Rule Mining with R

Text Mining with R

Time Series Analysis with R

Social Network Analysis with R

R and Hadoop

Online Resources

41 / 43

Page 44: An Introduction to Data Mining with R

Online Resources

I RDataMining websitehttp://www.rdatamining.com

I R Reference Card for Data MiningI R and Data Mining: Examples and Case Studies

I RDataMining Group on LinkedIn (3100+ members)http://group.rdatamining.com

I RDataMining on Twitter (1200+ followers)http://twitter.com/rdatamining

I Free online courseshttp://www.rdatamining.com/resources/courses

I Online documentshttp://www.rdatamining.com/resources/onlinedocs

42 / 43

Page 45: An Introduction to Data Mining with R

The End

Thanks!

Email: yanchang(at)rdatamining.com

43 / 43