part i: introductory materials introduction to r

Download Part I: Introductory Materials Introduction to R

Post on 14-Jan-2016




0 download

Embed Size (px)


Part I: Introductory Materials Introduction to R. Dr. Nagiza F. Samatova Department of Computer Science North Carolina State University and Computer Science and Mathematics Division Oak Ridge National Laboratory. What is R and why do we use it?. - PowerPoint PPT Presentation


  • Part I: Introductory MaterialsIntroduction to RDr. Nagiza F. SamatovaDepartment of Computer ScienceNorth Carolina State UniversityandComputer Science and Mathematics DivisionOak Ridge National Laboratory

  • *What is R and why do we use it? Open source, most widely used for statistical analysis and graphics Extensible via dynamically loadable add-on packages >1,800 packages on CRAN> > dyn.load( > .C( foobar )> dyn.unload( )> v = rnorm(256)> A = as.matrix (v,16,16)> summary(A)> library (fields)> image.plot (A)

  • *Why R?

  • *The Programmers DilemmaWhat programming language to use & why?

  • Features of R

    R is an integrated suite of software for data manipulation, calculation, and graphical display

    Effective data handlingVarious operators for calculations on arrays/matricesGraphical facilities for data analysisWell-developed language including conditionals, loops, recursive functions and I/O capabilities.

  • You can use R as a calculatorTyped expressions will be evaluated and printed outMain operations: +, -, *, /, ^Obeys order of operationsUse parentheses to group expressionsMore complex operations appear as functionssqrt(2)sin(pi/4), cos(pi/4), tan(pi/4), asin(1), acos(1), atan(1)exp(1), log(2), log10(10)Basic usage: arithmetic in R

  • *Getting helphelp(function_name)help(prcomp)?function_name? or ??topicSearch CRANhttp://www.r-project.orgFrom R GUI: Help Search helpCRAN Task Views (for individual packages)

  • Use variables to store valuesThree ways to assign variablesa = 6a aUpdate variables by using the current value in an assignmentx = x + 1Naming rulesCan include letters, numbers, ., and _Names are case sensitiveMust start with . or a letterVariables and assignment

  • R CommandsCommands can be expressions or assignmentsSeparate by semicolon or new lineCan split across multiple linesR will change prompt to + if command not finishedUseful commands for variablesls(): List all stored variablesrm(x): Delete one or more variablesclass(x): Describe what type of data a variable storessave(x,file=filename): Store variable(s) to a binary fileload(filename): Load all variables from a binary fileSave/load in current directory or My Documents by default

  • *Vectors and vector operations # c() command to create vector x x=c(12,32,54,33,21,65)# c() to add elements to vector x x=c(x,55,32)# seq() command to create sequence of number years=seq(1990,2003)# to contain in steps of .5 a=seq(3,5,.5)# can use : to step by 1years=1990:2003; # rep() command to create data that follow a regular pattern b=rep(1,5) c=rep(1:2,4)To create a vector:

  • *Matrices & matrix operations

  • Find # of elements or dimensionslength(v), length(A), dim(A)Transposet(v), t(A)Matrix inversesolve(A)Sort vector valuessort(v)Statisticsmin(), max(), mean(), median(), sum(), sd(), quantile()Treat matrices as a single vector (same with sort())Useful functions for vectors and matrices

  • Most common plotting function is plot()plot(x,y) plots y vs xplot(x) plots x vs 1:length(x)plot() has many options for labels, colors, symbol, size, etc.Check help with ?plotUse points(), lines(), or text() to add to an existing plotUse x11() to start a new output windowSave plots with png(), jpeg(), tiff(), or bmp()Graphical display and plotting

  • R functions and datasets are organized into packagesPackages base and stats include many of the built-in functions in RCRAN provides thousands of packages contributed by R usersPackage contents are only available when loadedLoad a package with library(pkgname)Packages must be installed before they can be loadedUse library() to see installed packagesUse install.packages(pkgname) and update.packages(pkgname) to install or update a packageCan also run R CMD INSTALL pkgname.tar.gz from command line if you have downloaded package sourceR Packages

  • *Exploring the iris dataLoad iris data into your R session: data (iris);help (data);Check that iris was indeed loaded:ls ();Check the class that the iris object belongs to:class (iris);Read Sections 3.4 and 6.3 in Introduction to R Print the content of iris data:iris;Check the dimensions of the iris data:dim (iris);Check the names of the columns:names (iris);

  • *Exploring the iris data (cont.)Plot Petal.Length vs. Petal.Width: plot (iris[ , 3], iris[ , 4]);example(plot)Exercise: create a plot similar to this figure:

    Src: Figure is from Introduction to Data Mining by Pang-Ning Tan, Michael Steinbach, and Vipin Kumar

  • Large data sets are better loaded through the file input interface in RReading a table of data can be done using the read.table() command:a
  • Programming in RThe following slides assume a basic understanding of programming concepts

    For more information, please see chapters 9 and 10 of the R manual:

    Additional resourcesBeginning R: An Introduction to Statistical Programming by Larry PaceIntroduction to R webpage on APSnet: R Inferno:*

  • Perform different commands in different situationsif (condition) command_if_trueCan add else command_if_false to endGroup multiple commands together with braces {}if (cond1) {cmd1; cmd2;} else if (cond2) {cmd3; cmd4;}Conditions use relational operators==, !=, , =Do not confuse = (assignment) with == (equality)= is a command, == is a questionCombine conditions with and (&&) and or (||)Use & and | for vectors of length > 1 (element-wise)Conditional statements

  • Most common type of loop is the for loopfor (x in v) { loop_commands; }v is a vector, commands repeat for each value in vVariable x becomes each value in v, in orderExample: adding the numbers 1-10total = 0; for (x in 1:10) total = total + x;Other type of loop is the while loopwhile (condition) { loop_commands; }Condition is identical to if statementCommands are repeated until condition is falseMight execute commands 0 times if already falsewhile loops are useful when you dont know number of iterationsLoops

  • Scripting in RA script is a sequence of R commands that perform some common taskE.g., defining a specific function, performing some analysis routine, etc.Save R commands in a plain text fileUsually have extension of .RRun scripts with source() :source(filename.R)To save command output to a file, use sink():sink(output.Rout)sink() restores output to consoleCan be used with or outside of a script

  • Objects containing an ordered collection of objectsComponents do not have to be of same typeUse list() to create a list:a
  • Writing functions in R is defined by an assignment like:a
  • *How do I get started with R (Linux)?Step 1: Download Rmkdir for RHOME; cd $RHOMEwget 2: Install Rtar zxvf R-2.9.1.tar.g./configure --prefix= --enable-R-shlib makemake installStep 3: Run RUpdate env. variables in $HOME/.bash_profile:export PATH=/bin:$PATHexport R_HOME= R

  • *Useful R linksR Home: Rs CRAN package distribution: Introduction to R manual: Writing R extensions: Other R documentation:

    *Go to CRAN web-site and show the libraries**Do the head count of who uses each tool routinely or more preferentially.

    There are other commercial software: S-PLUS, SAS

    Much limited data mining tools: Weka, DAP, gretl

    Matlab for matricesIDL for visualizationR for statistics, and open, extensible, large community

    ------------------------------------------Other Open Source Projects: Dap, gretl, Weka

    Other Proprietary Projects: IDL, Matlab, Mathematica

    Dap: a small statistics and graphics package based on C. Anyone familiar with the basic syntax of C programs can learn to use the C-style features of Dap quickly and easily; advanced features of C are not necessary, although they are available.

    Gretl: Gnu Regression, Econometrics and Time-series Library. for econometric analysis

    Weka: a collection of machine learning algorithms for data mining tasks.

    *Low level languages can yield good performance, but often the productivity suffers. In contrast, high-level languages can offer good productivity, at the expense of performance.

    Since many scientists use scripting environments, such as R, MATLAB and IDL, pR strives to bring the performance of low-level languages into high-level languages, specifically R, by enabling third-party compiled routines to execute transparently within R.

    Examples of parallel third-party libraries: ScaLAPACK, PETSc

    ******Take examples from R tutorial(s)*Take examples from R tutorial(s)**********Update this slide


View more >