r and data mining
TRANSCRIPT
R and Data Mining
美味书签 (AVOS China)杨朝中
R and Data Mining
● R 语言介绍● R 文本挖掘框架● High Performance Computing in R
● R 网络分析● 统计图形
R and Data Mining
● R 语言介绍● R 文本挖掘框架● High Performance Computing in R
● R 网络分析● 统计图形
R 语言介绍
● 统计计算
● CRAN (Comprehensive R Archive Network)
R 语言介绍
● 统计计算 对象类型 统计分析模型
● CRAN (Comprehensive R Archive Network)
对象类型
● 向量 (vector)
● 因子 (factor)
● 数组和矩阵 (array and matrix)
● 数据框和列表 (data.frame and list)
● 函数 (function)
向量 (vector)
> test.vector = c(1:100)> test.vector [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 [23] 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 [45] 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 [67] 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 [89] 89 90 91 92 93 94 95 96 97 98 99 100> test.vector[3][1] 3> test.vector[1][1] 1> sum(test.vector)[1] 5050> mean(test.vector)[1] 50.5> var(test.vector)[1] 841.6667> sd(test.vector)[1] 29.01149
因子 (factor)
> test.factor = factor(c(1,1,2,2,2,3,3,3,4,4,1,1,4,4))> test.factor [1] 1 1 2 2 2 3 3 3 4 4 1 1 4 4Levels: 1 2 3 4> levels(test.factor) = c("first","second","third","fourth")> test.factor [1] first first second second second third third third fourth fourth first first [13] fourth fourthLevels: first second third fourth> levels(test.factor) = c("a","b","c","d")> test.factor [1] a a b b b c c c d d a a d dLevels: a b c d
数组 (array)> test.array = array(rbinom(100,5,0.5),dim=c(4,5,5))> test.array, , 1
[,1] [,2] [,3] [,4] [,5][1,] 1 3 2 3 1[2,] 4 2 2 2 2[3,] 2 1 3 3 5[4,] 2 2 4 2 2> test.array[,3,] [,1] [,2] [,3] [,4] [,5][1,] 2 3 4 4 2[2,] 2 2 2 1 1[3,] 3 2 4 3 4[4,] 4 3 3 1 2> test.array[3,2,][1] 1 2 3 1 1
矩阵 (matrix)> test.matrix = matrix(rpois(50,5),nrow=5)> test.matrix [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10][1,] 6 3 12 7 6 2 3 5 4 4[2,] 2 5 11 3 1 4 7 2 5 5[3,] 2 4 1 5 1 3 2 7 5 8[4,] 4 7 5 8 4 5 3 2 6 2[5,] 9 15 5 6 2 4 8 8 5 3> t(test.matrix) [,1] [,2] [,3] [,4] [,5] [1,] 6 2 2 4 9 [2,] 3 5 4 7 15 [3,] 12 11 1 5 5 [4,] 7 3 5 8 6 [5,] 6 1 1 4 2 [6,] 2 4 3 5 4 [7,] 3 7 2 3 8 [8,] 5 2 7 2 8 [9,] 4 5 5 6 5[10,] 4 5 8 2 3
矩阵 (matix)> test.matrix = matrix(runif(25,min=1,max=5),nrow=5)> test.matrix [,1] [,2] [,3] [,4] [,5][1,] 1.844365 2.470590 4.744482 4.693239 2.597706[2,] 2.051089 2.954349 4.807748 3.974937 2.487159[3,] 4.554397 2.187724 4.519553 4.916905 3.988060[4,] 4.629351 3.770774 2.992690 4.660705 2.510643[5,] 3.894542 3.281654 2.471337 3.484586 2.115016> qr(test.matrix)$qr [,1] [,2] [,3] [,4] [,5][1,] -8.0591276 -6.30550129 -7.7768280 -9.2254948 -5.94547975[2,] 0.2545051 -2.20153679 -2.8030382 -2.2409546 -0.64008014[3,] 0.5651229 -0.83950762 -3.5747057 -2.2750825 -1.96267828[4,] 0.5744234 -0.15061209 -0.6607485 0.7479590 0.01142934[5,] 0.4832462 -0.07700937 -0.6148309 0.9179222 0.06790194
$rank[1] 5
$qraux[1] 1.22885416 1.51634534 1.43057441 1.39676050 0.06790194
矩阵 (matrix)> svd(test.matrix)$d[1] 17.66944239 3.22284465 1.78184517 0.61566884 0.05156261
$u [,1] [,2] [,3] [,4] [,5][1,] -0.4285623 -0.55858839 0.1433838 0.6112554 0.33184518[2,] -0.4207851 -0.46523651 0.3361892 -0.6261498 -0.31844658[3,] -0.5179119 0.03462469 -0.8461578 -0.1172279 -0.02903471[4,] -0.4722861 0.50932622 0.2777685 0.3687009 -0.55175807[5,] -0.3846913 0.45926238 0.2707020 -0.2908960 0.69511911
$v [,1] [,2] [,3] [,4] [,5][1,] -0.4356020 0.71976143 -0.31404796 -0.1898322 -0.39690304[2,] -0.3666388 0.23238151 0.80369243 -0.2606880 0.31256209[3,] -0.4958375 -0.64266729 -0.01537137 -0.4151453 -0.41053867[4,] -0.5530530 -0.10129870 0.04863968 0.8254724 -0.01001832[5,] -0.3522846 -0.06826158 -0.50284218 -0.2055605 0.75903264
矩阵 (matrix)
> cbind(test.matrix,rep(1,times=5)) [,1] [,2] [,3] [,4] [,5] [,6][1,] 1.844365 2.470590 4.744482 4.693239 2.597706 1[2,] 2.051089 2.954349 4.807748 3.974937 2.487159 1[3,] 4.554397 2.187724 4.519553 4.916905 3.988060 1[4,] 4.629351 3.770774 2.992690 4.660705 2.510643 1[5,] 3.894542 3.281654 2.471337 3.484586 2.115016 1> rbind(test.matrix, seq(1,2,length.out=5)) [,1] [,2] [,3] [,4] [,5][1,] 1.844365 2.470590 4.744482 4.693239 2.597706[2,] 2.051089 2.954349 4.807748 3.974937 2.487159[3,] 4.554397 2.187724 4.519553 4.916905 3.988060[4,] 4.629351 3.770774 2.992690 4.660705 2.510643[5,] 3.894542 3.281654 2.471337 3.484586 2.115016[6,] 1.000000 1.250000 1.500000 1.750000 2.000000
数据框 (data.frame)> test.data.frame =data.frame(id=1:10,name=letters[1:10],age=sample(c(25,23,24),size=10,replace=TRUE))> test.data.frame id name age1 1 a 252 2 b 233 3 c 234 4 d 235 5 e 246 6 f 247 7 g 248 8 h 259 9 i 2510 10 j 25> test.data.frame$id [1] 1 2 3 4 5 6 7 8 9 10> test.data.frame$name [1] a b c d e f g h i jLevels: a b c d e f g h i j> test.data.frame$age [1] 25 23 23 23 24 24 24 25 25 25
列表 (List)> test.list =list(test.vector,test.factor,test.array,test.matrix,test.data.frame)> str(test.list)List of 5 $ : int [1:100] 1 2 3 4 5 6 7 8 9 10 ... $ : Factor w/ 4 levels "a","b","c","d": 1 1 2 2 2 3 3 3 4 4 ... $ : num [1:4, 1:5, 1:5] 1 4 2 2 3 2 1 2 2 2 ... $ : num [1:5, 1:5] 1.84 2.05 4.55 4.63 3.89 ... $ :'data.frame': 10 obs. of 3 variables: ..$ id : int [1:10] 1 2 3 4 5 6 7 8 9 10 ..$ name: Factor w/ 10 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ..$ age : num [1:10] 25 23 23 23 24 24 24 25 25 25> test.list[4][[1]] [,1] [,2] [,3] [,4] [,5][1,] 1.844365 2.470590 4.744482 4.693239 2.597706[2,] 2.051089 2.954349 4.807748 3.974937 2.487159[3,] 4.554397 2.187724 4.519553 4.916905 3.988060[4,] 4.629351 3.770774 2.992690 4.660705 2.510643[5,] 3.894542 3.281654 2.471337 3.484586 2.115016
函数 (function)> test.function = function(x) factorial(x)> test.function(3)[1] 6>lapply(test.vector[31:35],test.function)[[1]][1] 8.222839e+33
[[2]][1] 2.631308e+35
[[3]][1] 8.683318e+36
[[4]][1] 2.952328e+38
[[5]][1] 1.033315e+40
统计分析模型
● 回归分析● 方差分析● 判别分析● 聚类分析● 主成分分析● 因子分析● 连续系统模拟、离散系统模拟
R 语言介绍
● 统计计算
● CRAN (Comprehensive R Archive Network)
CRAN
● CRAN Task Views● Natural Language Processing● Machine Learning & Statistical Learning● High-Performance and Parallel Computing with R● gRaphical Models in R● Graphic displays
R and Data Mining
● R 语言介绍● R 文本挖掘框架● High Performance Computing in R
● R 网络分析● 统计图形
R 文本挖掘框架
‘tm’ package UML 类图
Text Preprocessing in R
● 数据导入: Corpus 、 PlainTextDocument 、 tm_map
● 中文分词: rmmseg4j
● 英文词干提取: Rstem 、 Snowball 、 RWeka
● 英文句子识别: openNLP
● 英文同义词: wordnet
● 构造基于 tf-idf 的文档单词矩阵:DocumentTermMatrix 、 weightTfIdf
Preprocessing
library(tm)library(rmmseg4j)library(openNLP)library(Rstem)library(Snowball)
cor = Corpus(DirSource("~/work/text-mining/20news-bydate-test/1000/"),readerControl=list(reader=readPlain))
cwsed = tm_map(cor, function(x){ PlainTextDocument(mmseg4j(as.character(x), method="maxword"),id=ID(x))})
dtm = DocumentTermMatrix(cwsed, control=list(weighting = function(x){ weightTfIdf(x)},wordLengths=c(1,Inf)))
文本聚类降维处理++++++++++++++++++++++++++++++++++++++++++> nTerms(dtm)[1] 103757
> dtm2 = removeSparseTerms(dtm, 0.9)
> nTerms(dtm2)[1] 709++++++++++++++++++++++++++++++++++++++++++
聚类++++++++++++++++++++++++++++++++++++++++++km = kmeans(as.matrix(dtm2), centers=5, iter.max=10)
dbscan?
spectral clustering?
Cluster validation
● Internal measures
● Stability measures
● Biological
Internal measures
● Connectivity
● Silhouette Width
● Dunn Index
Stability measures
● Average Proportion of Non-overlap(APN)
● Average Distance (AD)
Stability measures
● Average Distance between Means (ADM)
● Figure of Merit (FOM)
Biological
● Biological Homogeneity Index (BHI)
● Biological Stability Index (BSI)
Cluster validation
library(tm)library(kernlab)library(clValid)
intern=clValid(as.matrix(dtm2),2:10,clMethods=c("hierarchical","kmeans","pam"),validation="internal",maxitems=3000)summary(intern)op <- par(no.readonly=TRUE)par(mfrow=c(2,2),mar=c(4,4,3,1))plot(intern, legend=FALSE)legend("right", clusterMethods(intern), col=1:9, lty=1:9, pch=paste(1:9))par(op)
文本分类
● 朴素贝叶斯
● 支持向量机 (Support Vector Machine)
台湾大学 林智仁Libsvm(e1071)
Liblinear(LiblinearR)
Evaluation and Acurracyimprovement
● Cross validation
● Bootstrap
● Ensemble Method
R and Data Mining
● R 语言介绍● R 文本挖掘框架● High Performance Computing in R
● R 网络分析● 统计图形
High Performance Computing in R● Parallel Computing
Rmpi 、 snowfall 、 snowFT 、
parallel(>=R 2.14) 、 Rhadoop
● Large memory and out-of-memory data
ff 、 HadoopStreaming
● Easier interfaces for Compiled code
Rcpp 、 Rjava 、 inline
● Profiling tools
profr 、 proftools
Rhadoop
http://www.revolutionanalytics.com/
Rhadoop
● Rmr2
mapreduce 、 from.dfs 、 to.dfs 、 keyval
● Rhdfs
hdfs.file 、 hdfs.close 、 hdfs.exists 、 hdfs.cp
hdfs.read● Rhbase
hb.new.table 、 hb.delete.table 、 hb.insert 、hb.get
k-medios.iter = function(points, distfun,ncenters,centers = NULL) { from.dfs(mapreduce(input = points, map = if (is.null(centers)) { function(k,v) keyval(sample(1:ncenters,1),v) } else { function(k,v) { distances = apply(centers, 1, function(c) distfun(c,v)) keyval(centers[which.min(distances),], v) } }, reduce = function(k,vv) keyval(NULL, iter.center(vv)), structured = T)) }
Parallel computing
library(snowfall)library(tm)library(kernlab)svm_parallel = function(dtm){ sfInit(parallel=TRUE, cpus=4, type="MPI") data = as.data.frame(inspect(dtm)) data$type = factor(rep(1:5, times=c(500,500,500,500,564))) levels(data$type) = c('sports','tech','news','education','learning') sub = sample(c(0,1,2,3,4), size=2564, replace=T) wrapper = function(x){ if(require(kernlab)){ ksvm(type ~., data=x) } } ksvm.models = sfLapplyLB(c(data[sub==0,],data[sub==1,],data[sub==2,],data[sub==3,],data[sub==4,]), wrapper) sfStop() ksvm.models }
Parallel computing> library(parallel)> cl =makeCluster(detectCores(logical=FALSE))> parLapplyLB(cl, 46:50, test.function)[[1]][1] 5.502622e+57
[[2]][1] 2.586232e+59
[[3]][1] 1.241392e+61
[[4]][1] 6.082819e+62
[[5]][1] 3.041409e+64
R and Data Mining
● R 语言介绍● R 文本挖掘框架● High Performance Computing in R
● R 网络分析● 统计图形
library(igraph)g <- graph.full(6,directed=FALSE)plot(g)
library(igraph)g <- graph.ring(10,directed=FALSE)plot(g)
library(igraph)g <- graph.star(16, mode = c("undirected"), center = 1)plot(g)
library(igraph)g <-graph(c(1,2,4,5,3,4,5,6),directed=FALSE)plot(g)
library(igraph)M <- matrix(runif(100),nrow=10)g <- graph.adjacency(M>0.9)plot(g)
> M[,1:5] [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [1,] 0.44746867 0.9753915 0.6890068 0.8500356 0.5812459 [2,] 0.10004725 0.9870645 0.9322102 0.6834764 0.8518852 [3,] 0.04882503 0.1599767 0.5268769 0.7756217 0.5713700 [4,] 0.91988082 0.4018993 0.3562261 0.7624379 0.1849250 [5,] 0.43281897 0.6032613 0.8240209 0.3340224 0.7189334 [6,] 0.87971431 0.9331585 0.4483813 0.4743045 0.5121772 [7,] 0.04519996 0.1875099 0.5615725 0.5913464 0.9487314 [8,] 0.78936780 0.6904077 0.6834867 0.2760950 0.1559759 [9,] 0.13621689 0.5607899 0.2745078 0.7246721 0.1932709 [10,] 0.54878255 0.4730136 0.7992216 0.4186087 0.2547914
> M[,1:5] > 0.9 [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [1,] FALSE TRUE FALSE FALSE FALSE [2,] FALSE TRUE TRUE FALSE FALSE [3,] FALSE FALSE FALSE FALSE FALSE [4,] TRUE FALSE FALSE FALSE FALSE [5,] FALSE FALSE FALSE FALSE FALSE [6,] FALSE TRUE FALSE FALSE FALSE [7,] FALSE FALSE FALSE FALSE TRUE [8,] FALSE FALSE FALSE FALSE FALSE [9,] FALSE FALSE FALSE FALSE FALSE [10,] FALSE FALSE FALSE FALSE FALSE
library(igraph)g1 <- graph.full(6, directed=FALSE)g2 <- graph(c(6,7,7,8,8,9,9,10,9,7,11,12,12,8),directed=FALSE)g <- graph.union(g1, g2)plot(g)
> V(g)Vertex sequence: [1] 1 2 3 4 5 6 7 8 9 10 11 12
> degree(g) [1] 5 5 5 5 5 6 3 3 3 1 1 2
> V(g)[degree(g)>1]Vertex sequence:[1] 1 2 3 4 5 6 7 8 9 12
> graph.dfs(g, 9)$order [1] 9 7 6 1 2 3 4 5 8 12 11 10
> graph.bfs(g, 9)$order [1] 9 7 8 10 6 12 1 2 3 4 5 11
网络分析● igraph
● graph
● network
● sna
R and Data Mining
● R 语言介绍● R 文本挖掘框架● High Performance Computing in R
● R 网络分析基本● 统计图形
统计图形
Statistical graphics is, or should be, antransdisciplinary field informed by scientific,statistical,computing, aesthetic, psychologicaland sociological considerations.[LelandWilkinson, The Grammar of Graphics]
The grammar of Graphics
In brief, the grammar tells us that the statisticalgraphic is a mapping from data to aestheticattributes(color, shape,size) of geometricobjects(points, lines, bars).
直方图 (hist)
条形图 (barplot)
散点图 (plot)
> x=seq(from=-pi,to=pi,length.out=100)> y=sin(x)> plot(x, y, col="blue")
概率密度曲线
> x=seq(from=-pi,to=pi,length.out=100)> y = dnorm(x)> plot(x, y, col="blue")
颜色等高图
散点图矩阵
矩阵图 (matplot)
matplot(test.matrix,type="b")
高级绘图程序● lattice
● ggplot2
An implementation of the grammar of graphicsin R
ggplot2
● Data( 数据 ) 和 Mapping( 映射 )
● Geom( 几何对象 )
● Stat( 统计变换 )
● Scale( 标度 )
● Coord( 坐标系统 )
● Facet( 分面 )
● Layer( 图层 )
ggplot2
● 测试数据> str(mpg)'data.frame': 234 obs. of 11 variables: $ manufacturer: Factor w/ 15 levels "audi","chevrolet",..: 1 1 1 1 1 1 1 1 1 1 ... $ model : Factor w/ 38 levels "4runner 4wd",..: 2 2 2 2 2 2 2 3 3 3 ... $ displ : num 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ... $ year : int 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ... $ cyl : int 4 4 4 4 6 6 6 4 4 4 ... $ trans : Factor w/ 10 levels "auto(av)","auto(l3)",..: 4 9 10 1 4 9 1 9 4 10 ... $ drv : Factor w/ 3 levels "4","f","r": 2 2 2 2 2 2 2 1 1 1 ... $ cty : int 18 21 20 21 16 18 18 18 16 20 ... $ hwy : int 29 29 31 30 26 26 27 26 25 28 ... $ fl : Factor w/ 5 levels "c","d","e","p",..: 4 4 4 4 4 4 4 4 4 4 ... $ class : Factor w/ 7 levels "2seater","compact",..: 2 2 2 2 2 2 2 2 2 2 ...
ggplot2
> library(ggplot2)> p <- ggplot(data=mpg,mapping=aes(x=cty,y=hwy))> p + geom_point()
ggplot2
> p <- ggplot(data=mpg,mapping=aes(x=cty,y=hwy,colour=factor(year)))> p + geom_point()
ggplot2
> p + geom_point() + stat_smooth()
ggplot2
> p + geom_point(mapping=aes(size=displ)) +stat_smooth()
ggplot2
> p + geom_point(mapping=aes(size=displ)) + stat_smooth() +coord_cartesian(xlim=c(20,30),ylim=c(0,40))
ggplot2
> p + geom_point(mapping=aes(size=displ)) + stat_smooth() +facet_wrap(~year,ncol=2)
ggplot2
qplot(x,y,colour=factor(y))
ggplot2
y = sin(x) + rnorm(100)
qplot(x,y,colour=factor(y))
ggplot2
plotmatrix(data,mapping=aes(),colour="blue")
R 中文博客
● 肖凯http://xccds1977.blogspot.jp
● 刘思喆
统计之都 R 语言版版主http://cos.name/cn/
● 谢益辉http://yihui.name/
国外网站
● 数据科学家 twitter
Big Data: Experts to Follow on Twitter
● R 语言相关论文或书籍Journal of Statistical Software
● R and Data Mining
http://www.rdatamining.com/● R-project search
http://www.rseek.org/