오픈소스기반의통계언어r과...
TRANSCRIPT
-
Next RevolutionToward Open Platform
R
NexR Data Scientist Jeon Hee-Won
-
Next RevolutionToward Open Platform -2-
R R , R , R , R , R
, The Marriage of Hadoop and R NexR's Way for Big Data Analysis
Etc KRUG(Korean R User Group) Korea R CRAN Mirror Ststistics
-
Next RevolutionToward Open Platform -3-
R
R is a language and environment for statistical computingand graphics. It is a GNU project which is similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be considered as a different implementation of S. There are some important differences, but much code written for S runs unaltered under R.
R is a language and environment for statistical computingand graphics. It is a GNU project which is similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be considered as a different implementation of S. There are some important differences, but much code written for S runs unaltered under R.
O/S
Analysis SystemAnalysis System
UNIX
The S system The S system
Bell Lab
BSD/System V HP, IBM, SUN
S-PLUS S-PLUS
Commercial
LINUX Application
R Packages R Packages
GNU/Open source
-
Next RevolutionToward Open Platform -4-
R
JohnChamber
JohnChamber
1976
Version 1Fortran-based
Version 1Fortran-based
1980 1988 1998
Version 3C-base
Class/Method
Version 3C-base
Class/Method
Version 4Java interfaceClass/Method
Version 4Java interfaceClass/Method
1993
StatSciStatSci
Version 2UNIX
Version 2UNIX
1988
With MathSoftE-license
With MathSoftE-license
2001
InsightfulInsightful
2008
TIBCOTIBCO
1993
Ross IhakaRobert Gentleman
Ross IhakaRobert Gentleman
1997.4.1
Mailing list
Mailing list
1997.4.23
CRANCRAN
1997.12.5
GNU ProjectGNU
Project
05/V.7/Big data07/V.8/R package05/V.7/Big data
07/V.8/R package
2000.1
Version 1.0
Version 1.0
-
Next RevolutionToward Open Platform -5-
R
Richard Stallman GNU
GNU (GNU is Not Unix) Project
GPL (General Public License) : .
Free Software = <
", , , " , (License)
The R Foundation for Statistical Computing
(R Development Core Team)
The R Foundation for Statistical Computing
(R Development Core Team)
WindowsUNIXOS X
3,452 Packages(2011/12/01)
WindowsUNIXOS X
3,452 Packages(2011/12/01)
BioConductorAnalysis genomic data
More 460 Packages
Free Software Foundation
The Comprehensive R Archive Network
(CRAN)
The Comprehensive R Archive Network
(CRAN)
organization
distribution
Related Projects
-
Next RevolutionToward Open Platform -6-
R
Interpreter Language
> tot = 0> for (i in 1:10) {+ tot = tot + i+ }> print(tot)[1] 55 > sum(1:10)[1] 55
SAS
PROC FREQ OPTIONS1;TABLES requests/OPRIONS2;WEIGHT variable;BY variables;
Procedure SPSS
VS
VS
-
Next RevolutionToward Open Platform -7-
R -cont
Connectivity
Language Interface:C, C++, FORTRAN, JAVA, Python, Tcl/tk, VB,Perl, Ruby
Application Interface:Excel, Google earth, ArcView, COM/DCOM, etc
DB Interface:ODBC (Oracle, Mysql, MS-SQL, PostgreSql, ...)
IDE:Rstudio, eclipse, emacs, Bluefish, Crimson Editor, ConTEXT, Vim, Jedit, Kate, TextMate, gedit, SciTE, WinEdt
Application Platform R
Revolution Analytics- Revolution R
IBM- Netteza Appliance DB
EMC- Greenplum Appliance DB
-
Next RevolutionToward Open Platform -8-
R -cont
Data ObjectsVector : Factor : Ordered factor : Matrix : List : , C Data Frame : ,
DBMS Table Array : Time Series :
Vectorize :Loop apply, lappy, tapply, outer,
Vectorize
> mat = matrix(1:12, ncol=4)> mat
[,1] [,2] [,3] [,4][1,] 1 4 7 10[2,] 2 5 8 11[3,] 3 6 9 12> apply(mat, 2, sum)[1] 6 15 24 33> colMeans(mat)[1] 2 5 8 11
matrix, vector
-
Next RevolutionToward Open Platform -9-
R -cont
-
> stack.loss[1:6][1] 42 37 37 28 18 18 > X head(X)
Air.Flow Water.Temp Acid.Conc.[1,] 1 80 27 89[2,] 1 80 27 88[3,] 1 75 25 90[4,] 1 62 24 87[5,] 1 62 22 87[6,] 1 62 23 87
> solve(t(X) %*% X) %*% t(X) %*% stack.loss [,1]
-39.9196744Air.Flow 0.7156402Water.Temp 1.2952861Acid.Conc. -0.1521225> lm(stack.loss ~ stack.x)
Call:lm(formula = stack.loss ~ stack.x)
Coefficients:(Intercept) stack.xAir.Flow stack.xWater.Temp stack.xAcid.Conc.
-39.9197 0.7156 1.2953 -0.1521
/
-
Next RevolutionToward Open Platform -10-
R -cont
Like UNIX Command Bell Lab
() ls : rm : grep : apropos : () find : vi, emacs : text editor cat : haed : tail : diff : paste : split :
Hidden Objects .
> ls(pat="^p")[1] "pattern.features"> apropos("sum$")[1] "contr.sum" "cumsum" "rowsum" "sum" > head(iris[,1:2], n=3)Sepal.Length Sepal.Width
1 5.1 3.52 4.9 3.03 4.7 3.2
Bell Lab S LanguageUNIX
-
Next RevolutionToward Open Platform -11-
R -cont
Graphics
Graphics Devicesbmp, jpeg, png, tiff, pdf, postscript, SVG(R 2.14)
other SupportOpenGL, Spatial(Archview, googleMap) ,
Low level Plotpoints, lines, box, rect, polygontext, title, mtextlegend, axis, grid
High level Plotplot, barplot, boxplot, pie, qqplot, .trellis(lattice packages), rgl, sna, wordcloud,
x
-
Next RevolutionToward Open Platform -12-
Using R
http://www.kdnuggets.com/2011/08/poll-languages-for-data-mining-analytics.htmlhttp://blog.revolutionanalytics.com/2011/11/r-still-the-preferred-tool-of-predictive-modelers-competing-at-kaggle.html
R / .
-
Next RevolutionToward Open Platform -13-
R Packaging System
R
R ftp://cran.r-project.org/incoming [email protected]
http://journal.r-project.org/archive/2009-2/RJournal_2009-2_Fox.pdf
-
Next RevolutionToward Open Platform -14-
- The Data Flood
-
Next RevolutionToward Open Platform -15-
- -
' ' . 'Big data is a term applied to data sets whose size is beyond the
ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time. -wikipedia
" " .
+
(Data Scientist) ... Hadoop R (?)
-
Next RevolutionToward Open Platform -16-
(Data Science)
+ + +
.
.
" ".
http://benfry.com/phd/dissertation-050312b-acrobat.pdf
-
Next RevolutionToward Open Platform -17-
Why R in Big Data analysis?
?
, ? R . (SAS, SPSS, numpy??)
Hadoop Paper(or book) + R Packages ->
R R core (, ) (Revolution Analytics)
-
Next RevolutionToward Open Platform -18-
R /
ff, bigmemory, RevoScaleR GB 10GB
gc(), rm() 32 , 2^31-1
R 2.15 2^51 No int64
int64 package from Google
64bit
Single Core CPU 1 . R 2.14 parallel
TB
TB
-
Next RevolutionToward Open Platform -19-
Why Hadoop for Big Data Analysis?
Hadoop has become the kernel of the distributed operating system for Big Data
Hadoop World 2011 from Doug Cutting keynote Hadoop Hadoop
-
Next RevolutionToward Open Platform -20-
The Marriage of Hadoop and R(1)
R , Hadoop .
.
R R shell . R map/reduce ?
R ? PMML?
-
Next RevolutionToward Open Platform -21-
The Marriage of Hadoop and R(2)
RHIPE(R and Hadoop Integrated Processing Environment) Purdue Univ. Saptarshi Guha R R Hadoop MapReduce Amazon EC2 (http://www.stat.purdue.edu/~sguha/rhipe/doc/html/ec2.html ) RHadoop Revolution Analytics
Facebook R+RHIPE Guhas lecturehttp://www.lecturemaker.com/2011/02/rhipe/
RHipe
-
Next RevolutionToward Open Platform -22-
!
Map/Reduce ?
Map/Reduce ? Pig
Streamingnative map/reduce with JavaRHipeRHadoop....
PigStreamingnative map/reduce with JavaRHipeRHadoop....
-
Next RevolutionToward Open Platform -23-
NexR's Way for Big Data Analysis
Map/Reduce for data analysis? .
SQL for data analysis! . .
select * from foo;
-
Next RevolutionToward Open Platform -24-
RHive Sample Flight Delay Prediction
library(RHive)rhive.connect("127.0.0.1")
# get a training data set from Hivetrainset
-
Next RevolutionToward Open Platform -25-
R
-
Next RevolutionToward Open Platform -26-
KRUG
KRUG (Korean R Users Group)
GNU , R 2007 1
http://www.r-project.kr/http://www.openstatistics.net
R User Conference Online : , , Q&A
Offline : Meetup
: /White paper/Blog /
-
Next RevolutionToward Open Platform -27-
Korea R CRAN Mirror Statistics
6 24,000 6 24,000
3,452 R 3,392 3,452 R 3,392
-
Next RevolutionToward Open Platform -28-
Q & A
http://[email protected]://[email protected]