오픈소스기반의통계언어r과...

Download 오픈소스기반의통계언어R과 빅데이터분석datamining.dongguk.ac.kr/R/1_1클라우드컴퓨팅구현기술... · Next Revolution Toward Open Platform 오픈소스기반의통계언어R과

If you can't read please download the document

Upload: hoangdang

Post on 12-Feb-2018

235 views

Category:

Documents


5 download

TRANSCRIPT

  • Next RevolutionToward Open Platform

    R

    NexR Data Scientist Jeon Hee-Won

  • Next RevolutionToward Open Platform -2-

    R R , R , R , R , R

    , The Marriage of Hadoop and R NexR's Way for Big Data Analysis

    Etc KRUG(Korean R User Group) Korea R CRAN Mirror Ststistics

  • Next RevolutionToward Open Platform -3-

    R

    R is a language and environment for statistical computingand graphics. It is a GNU project which is similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be considered as a different implementation of S. There are some important differences, but much code written for S runs unaltered under R.

    R is a language and environment for statistical computingand graphics. It is a GNU project which is similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be considered as a different implementation of S. There are some important differences, but much code written for S runs unaltered under R.

    O/S

    Analysis SystemAnalysis System

    UNIX

    The S system The S system

    Bell Lab

    BSD/System V HP, IBM, SUN

    S-PLUS S-PLUS

    Commercial

    LINUX Application

    R Packages R Packages

    GNU/Open source

  • Next RevolutionToward Open Platform -4-

    R

    JohnChamber

    JohnChamber

    1976

    Version 1Fortran-based

    Version 1Fortran-based

    1980 1988 1998

    Version 3C-base

    Class/Method

    Version 3C-base

    Class/Method

    Version 4Java interfaceClass/Method

    Version 4Java interfaceClass/Method

    1993

    StatSciStatSci

    Version 2UNIX

    Version 2UNIX

    1988

    With MathSoftE-license

    With MathSoftE-license

    2001

    InsightfulInsightful

    2008

    TIBCOTIBCO

    1993

    Ross IhakaRobert Gentleman

    Ross IhakaRobert Gentleman

    1997.4.1

    Mailing list

    Mailing list

    1997.4.23

    CRANCRAN

    1997.12.5

    GNU ProjectGNU

    Project

    05/V.7/Big data07/V.8/R package05/V.7/Big data

    07/V.8/R package

    2000.1

    Version 1.0

    Version 1.0

  • Next RevolutionToward Open Platform -5-

    R

    Richard Stallman GNU

    GNU (GNU is Not Unix) Project

    GPL (General Public License) : .

    Free Software = <

    ", , , " , (License)

    The R Foundation for Statistical Computing

    (R Development Core Team)

    The R Foundation for Statistical Computing

    (R Development Core Team)

    WindowsUNIXOS X

    3,452 Packages(2011/12/01)

    WindowsUNIXOS X

    3,452 Packages(2011/12/01)

    BioConductorAnalysis genomic data

    More 460 Packages

    Free Software Foundation

    The Comprehensive R Archive Network

    (CRAN)

    The Comprehensive R Archive Network

    (CRAN)

    organization

    distribution

    Related Projects

  • Next RevolutionToward Open Platform -6-

    R

    Interpreter Language

    > tot = 0> for (i in 1:10) {+ tot = tot + i+ }> print(tot)[1] 55 > sum(1:10)[1] 55

    SAS

    PROC FREQ OPTIONS1;TABLES requests/OPRIONS2;WEIGHT variable;BY variables;

    Procedure SPSS

    VS

    VS

  • Next RevolutionToward Open Platform -7-

    R -cont

    Connectivity

    Language Interface:C, C++, FORTRAN, JAVA, Python, Tcl/tk, VB,Perl, Ruby

    Application Interface:Excel, Google earth, ArcView, COM/DCOM, etc

    DB Interface:ODBC (Oracle, Mysql, MS-SQL, PostgreSql, ...)

    IDE:Rstudio, eclipse, emacs, Bluefish, Crimson Editor, ConTEXT, Vim, Jedit, Kate, TextMate, gedit, SciTE, WinEdt

    Application Platform R

    Revolution Analytics- Revolution R

    IBM- Netteza Appliance DB

    EMC- Greenplum Appliance DB

  • Next RevolutionToward Open Platform -8-

    R -cont

    Data ObjectsVector : Factor : Ordered factor : Matrix : List : , C Data Frame : ,

    DBMS Table Array : Time Series :

    Vectorize :Loop apply, lappy, tapply, outer,

    Vectorize

    > mat = matrix(1:12, ncol=4)> mat

    [,1] [,2] [,3] [,4][1,] 1 4 7 10[2,] 2 5 8 11[3,] 3 6 9 12> apply(mat, 2, sum)[1] 6 15 24 33> colMeans(mat)[1] 2 5 8 11

    matrix, vector

  • Next RevolutionToward Open Platform -9-

    R -cont

    -

    > stack.loss[1:6][1] 42 37 37 28 18 18 > X head(X)

    Air.Flow Water.Temp Acid.Conc.[1,] 1 80 27 89[2,] 1 80 27 88[3,] 1 75 25 90[4,] 1 62 24 87[5,] 1 62 22 87[6,] 1 62 23 87

    > solve(t(X) %*% X) %*% t(X) %*% stack.loss [,1]

    -39.9196744Air.Flow 0.7156402Water.Temp 1.2952861Acid.Conc. -0.1521225> lm(stack.loss ~ stack.x)

    Call:lm(formula = stack.loss ~ stack.x)

    Coefficients:(Intercept) stack.xAir.Flow stack.xWater.Temp stack.xAcid.Conc.

    -39.9197 0.7156 1.2953 -0.1521

    /

  • Next RevolutionToward Open Platform -10-

    R -cont

    Like UNIX Command Bell Lab

    () ls : rm : grep : apropos : () find : vi, emacs : text editor cat : haed : tail : diff : paste : split :

    Hidden Objects .

    > ls(pat="^p")[1] "pattern.features"> apropos("sum$")[1] "contr.sum" "cumsum" "rowsum" "sum" > head(iris[,1:2], n=3)Sepal.Length Sepal.Width

    1 5.1 3.52 4.9 3.03 4.7 3.2

    Bell Lab S LanguageUNIX

  • Next RevolutionToward Open Platform -11-

    R -cont

    Graphics

    Graphics Devicesbmp, jpeg, png, tiff, pdf, postscript, SVG(R 2.14)

    other SupportOpenGL, Spatial(Archview, googleMap) ,

    Low level Plotpoints, lines, box, rect, polygontext, title, mtextlegend, axis, grid

    High level Plotplot, barplot, boxplot, pie, qqplot, .trellis(lattice packages), rgl, sna, wordcloud,

    x

  • Next RevolutionToward Open Platform -12-

    Using R

    http://www.kdnuggets.com/2011/08/poll-languages-for-data-mining-analytics.htmlhttp://blog.revolutionanalytics.com/2011/11/r-still-the-preferred-tool-of-predictive-modelers-competing-at-kaggle.html

    R / .

  • Next RevolutionToward Open Platform -13-

    R Packaging System

    R

    R ftp://cran.r-project.org/incoming [email protected]

    http://journal.r-project.org/archive/2009-2/RJournal_2009-2_Fox.pdf

  • Next RevolutionToward Open Platform -14-

    - The Data Flood

  • Next RevolutionToward Open Platform -15-

    - -

    ' ' . 'Big data is a term applied to data sets whose size is beyond the

    ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time. -wikipedia

    " " .

    +

    (Data Scientist) ... Hadoop R (?)

  • Next RevolutionToward Open Platform -16-

    (Data Science)

    + + +

    .

    .

    " ".

    http://benfry.com/phd/dissertation-050312b-acrobat.pdf

  • Next RevolutionToward Open Platform -17-

    Why R in Big Data analysis?

    ?

    , ? R . (SAS, SPSS, numpy??)

    Hadoop Paper(or book) + R Packages ->

    R R core (, ) (Revolution Analytics)

  • Next RevolutionToward Open Platform -18-

    R /

    ff, bigmemory, RevoScaleR GB 10GB

    gc(), rm() 32 , 2^31-1

    R 2.15 2^51 No int64

    int64 package from Google

    64bit

    Single Core CPU 1 . R 2.14 parallel

    TB

    TB

  • Next RevolutionToward Open Platform -19-

    Why Hadoop for Big Data Analysis?

    Hadoop has become the kernel of the distributed operating system for Big Data

    Hadoop World 2011 from Doug Cutting keynote Hadoop Hadoop

  • Next RevolutionToward Open Platform -20-

    The Marriage of Hadoop and R(1)

    R , Hadoop .

    .

    R R shell . R map/reduce ?

    R ? PMML?

  • Next RevolutionToward Open Platform -21-

    The Marriage of Hadoop and R(2)

    RHIPE(R and Hadoop Integrated Processing Environment) Purdue Univ. Saptarshi Guha R R Hadoop MapReduce Amazon EC2 (http://www.stat.purdue.edu/~sguha/rhipe/doc/html/ec2.html ) RHadoop Revolution Analytics

    Facebook R+RHIPE Guhas lecturehttp://www.lecturemaker.com/2011/02/rhipe/

    RHipe

  • Next RevolutionToward Open Platform -22-

    !

    Map/Reduce ?

    Map/Reduce ? Pig

    Streamingnative map/reduce with JavaRHipeRHadoop....

    PigStreamingnative map/reduce with JavaRHipeRHadoop....

  • Next RevolutionToward Open Platform -23-

    NexR's Way for Big Data Analysis

    Map/Reduce for data analysis? .

    SQL for data analysis! . .

    select * from foo;

  • Next RevolutionToward Open Platform -24-

    RHive Sample Flight Delay Prediction

    library(RHive)rhive.connect("127.0.0.1")

    # get a training data set from Hivetrainset

  • Next RevolutionToward Open Platform -25-

    R

  • Next RevolutionToward Open Platform -26-

    KRUG

    KRUG (Korean R Users Group)

    GNU , R 2007 1

    http://www.r-project.kr/http://www.openstatistics.net

    R User Conference Online : , , Q&A

    Offline : Meetup

    : /White paper/Blog /

  • Next RevolutionToward Open Platform -27-

    Korea R CRAN Mirror Statistics

    6 24,000 6 24,000

    3,452 R 3,392 3,452 R 3,392

  • Next RevolutionToward Open Platform -28-

    Q & A

    http://[email protected]://[email protected]