introduction to the r language ahmed rebai bioinformatics and comparative genome analysis
TRANSCRIPT
![Page 1: Introduction to the R language Ahmed Rebai Bioinformatics and Comparative Genome Analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062305/56649e175503460f94b03428/html5/thumbnails/1.jpg)
Introduction to the R language
Ahmed Rebai
Bioinformatics and Comparative Genome Analysis
![Page 2: Introduction to the R language Ahmed Rebai Bioinformatics and Comparative Genome Analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062305/56649e175503460f94b03428/html5/thumbnails/2.jpg)
Websites• R www.r-project.org
– software; – documentation; – RNews.
• Bioconductor www.bioconductor.org– software, data, and documentation; – training materials from short courses;– www.bioconductor.org/workshops/UCSC03/uc
sc03.html– mailing list.
![Page 3: Introduction to the R language Ahmed Rebai Bioinformatics and Comparative Genome Analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062305/56649e175503460f94b03428/html5/thumbnails/3.jpg)
R language: Overview• Open source and open development.• Design and deployment of portable, extensible,
and scalable software.• Interoperability with other languages: C, XML.• Variety of statistical and numerical methods.• High quality visualization and graphics tools.• Effective, extensible user interface.• Innovative tools for producing documentation
and training materials: vignettes.• Supports the creation, testing, and distribution of
software and data modules: packages.
![Page 4: Introduction to the R language Ahmed Rebai Bioinformatics and Comparative Genome Analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062305/56649e175503460f94b03428/html5/thumbnails/4.jpg)
R user interface• Batch or command line processing
bash$ R to startR> q() to quit
• Graphics windows> X11()> postscript()> dev.off()
• File path is relative to working directory> getwd()> setwd()
• Load a package library with library()• GUIs, tcltk
![Page 5: Introduction to the R language Ahmed Rebai Bioinformatics and Comparative Genome Analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062305/56649e175503460f94b03428/html5/thumbnails/5.jpg)
Getting Helpo Details about a specific command whose name you know (input arguments, options, algorithm):
> ? t.test > help(t.test)
o See an example of usage:
> demo(graphics)> example(mean) mean> x <- c(0:10, 50) mean> xm <- mean(x) mean> c(xm, mean(x, trim = 0.1)) [1] 8.75 5.50
![Page 6: Introduction to the R language Ahmed Rebai Bioinformatics and Comparative Genome Analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062305/56649e175503460f94b03428/html5/thumbnails/6.jpg)
Getting Helpo HTML search engine lets you search for topics
with regular expressions:
> help.search
o Find commands containing a regular expression or object name:
> apropos("var") [1] "var.na" ".__M__varLabels:Biobase" [3] "varLabels" "var.test" [5] "varimax" "all.vars" [7] "var" "variable.names" [9] "variable.names.default" "variable.names.lm"
![Page 7: Introduction to the R language Ahmed Rebai Bioinformatics and Comparative Genome Analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062305/56649e175503460f94b03428/html5/thumbnails/7.jpg)
Getting Helpo Vignettes contain text and executable code:
> library(tkWidgets)> vExplorer()> openVignette()
Created using the Sweave() function..Rnw files produce a PDF file and a vignette.
o To see code for a function, type the name with no parentheses or arguments:
> plot
![Page 8: Introduction to the R language Ahmed Rebai Bioinformatics and Comparative Genome Analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062305/56649e175503460f94b03428/html5/thumbnails/8.jpg)
R as a Calculator
> log2(32)
[1] 5
> print(sqrt(2))
[1] 1.414214
> pi
[1] 3.141593
> seq(0, 5, length=6)
[1] 0 1 2 3 4 5
> 1+1:10
[1] 2 3 4 5 6 7 8 9 10 11
![Page 9: Introduction to the R language Ahmed Rebai Bioinformatics and Comparative Genome Analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062305/56649e175503460f94b03428/html5/thumbnails/9.jpg)
R as a Graphics Tool
> plot(sin(seq(0, 2*pi, length=100)))
0 20 40 60 80 100
-1.0
-0.5
0.0
0.5
1.0
Index
sin(seq(0, 2 * pi, length = 100))
![Page 10: Introduction to the R language Ahmed Rebai Bioinformatics and Comparative Genome Analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062305/56649e175503460f94b03428/html5/thumbnails/10.jpg)
> a <- 49> sqrt(a)[1] 7
> b <- "The dog ate my homework"> sub("dog","cat",b)[1] "The cat ate my homework"
> c <- (1+1==3)> c[1] FALSE> as.character(b)[1] "FALSE"
numeric
character string
logical
Variables
![Page 11: Introduction to the R language Ahmed Rebai Bioinformatics and Comparative Genome Analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062305/56649e175503460f94b03428/html5/thumbnails/11.jpg)
Missing Values
Variables of each data type (numeric, character, logical) can also take the value NA: not available. o NA is not the same as 0o NA is not the same as “”o NA is not the same as FALSEo NA is not the same as NULL
Operations that involve NA may or may not produce NA:
> NA==1[1] NA> 1+NA[1] NA> max(c(NA, 4, 7))[1] NA> max(c(NA, 4, 7), na.rm=T)[1] 7
> NA | TRUE[1] TRUE> NA & TRUE[1] NA
![Page 12: Introduction to the R language Ahmed Rebai Bioinformatics and Comparative Genome Analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062305/56649e175503460f94b03428/html5/thumbnails/12.jpg)
Vectorsvector: an ordered collection of data of the same type> a <- c(1,2,3)> a*2[1] 2 4 6
Example: the mean spot intensities of all 15488 spots on a microarray is a numeric vector
In R, a single number is the special case of a vector with 1 element.
Other vector types: character strings, logical
![Page 13: Introduction to the R language Ahmed Rebai Bioinformatics and Comparative Genome Analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062305/56649e175503460f94b03428/html5/thumbnails/13.jpg)
Matrices and Arrays
matrix: rectangular table of data of the same type
Example: the expression values for 10000 genes for 30 tissue biopsies is a numeric matrix with 10000 rows and 30 columns.
array: 3-,4-,..dimensional matrix
Example: the red and green foreground and background values for 20000 spots on 120 arrays is a 4 x 20000 x 120 (3D) array.
![Page 14: Introduction to the R language Ahmed Rebai Bioinformatics and Comparative Genome Analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062305/56649e175503460f94b03428/html5/thumbnails/14.jpg)
Listslist: ordered collection of data of arbitrary types.
Example:> doe <- list(name="john",age=28,married=F)> doe$name[1] "john“> doe$age[1] 28> doe[[3]][1] FALSE
Typically, vector elements are accessed by their index (an integer) and list elements by $name (a character string). But both types support both access methods. Slots are accessed by @name.
![Page 15: Introduction to the R language Ahmed Rebai Bioinformatics and Comparative Genome Analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062305/56649e175503460f94b03428/html5/thumbnails/15.jpg)
Data Frames
data frame: rectangular table with rows and columns; data within each column has the same type (e.g. number, text, logical), but different columns may have different types.Represents the typical data table that researchers come up with – like a spreadsheet.
Example:> a <-data.frame(localization,tumorsize,progress,row.names=patients)> a localization tumorsize progressXX348 proximal 6.3 FALSEXX234 distal 8.0 TRUEXX987 proximal 10.0 FALSE
![Page 16: Introduction to the R language Ahmed Rebai Bioinformatics and Comparative Genome Analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062305/56649e175503460f94b03428/html5/thumbnails/16.jpg)
What type is my data?class Class from which object inherits
(vector, matrix, function, logical, list, … )
mode Numeric, character, logical, …storage.mode
typeofMode used by R to store object (double, integer, character, logical, …)
is.function Logical (TRUE if function)is.na Logical (TRUE if missing)names Names associated with objectdimnames Names for each dim of arrayslotNames Names of slots of BioC objectsattributes Names, class, etc.
![Page 17: Introduction to the R language Ahmed Rebai Bioinformatics and Comparative Genome Analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062305/56649e175503460f94b03428/html5/thumbnails/17.jpg)
SubsettingIndividual elements of a vector, matrix, array or data frame are accessed with “[ ]” by specifying their index, or their name
> a localization tumorsize progressXX348 proximal 6.3 0XX234 distal 8.0 1XX987 proximal 10.0 0
> a[3, 2][1] 10
> a["XX987", "tumorsize"][1] 10
> a["XX987",] localization tumorsize progressXX987 proximal 10 0
![Page 18: Introduction to the R language Ahmed Rebai Bioinformatics and Comparative Genome Analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062305/56649e175503460f94b03428/html5/thumbnails/18.jpg)
>a localization tumorsize progressXX348 proximal 6.3 0XX234 distal 8.0 1XX987 proximal 10.0 0
> a[c(1,3),] localization tumorsize progressXX348 proximal 6.3 0XX987 proximal 10.0 0
> a[-c(1,2),]localization tumorsize progressXX987 proximal 10.0 0
> a[c(T,F,T),] localization tumorsize progressXX348 proximal 6.3 0XX987 proximal 10.0 0
> a$localization[1] "proximal" "distal" "proximal"
> a$localization=="proximal"[1] TRUE FALSE TRUE
> a[ a$localization=="proximal", ] localization tumorsize progressXX348 proximal 6.3 0XX987 proximal 10.0 0
subset rows by a vector of indices
subset rows by a logical vector
subset columns
comparison resulting in logical vector
subset the selected rows
Example:
![Page 19: Introduction to the R language Ahmed Rebai Bioinformatics and Comparative Genome Analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062305/56649e175503460f94b03428/html5/thumbnails/19.jpg)
Functions and Operators
Functions do things with data“Input”: function arguments (0,1,2,…)“Output”: function result (exactly one)
Example:add <- function(a,b) {
result <- a+b return(result) }
Operators: Short-cut writing for frequently used functions of one or two arguments.
![Page 20: Introduction to the R language Ahmed Rebai Bioinformatics and Comparative Genome Analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062305/56649e175503460f94b03428/html5/thumbnails/20.jpg)
Frequently used operators
<- Assign
+ Sum
- Difference
* Multiplication
/ Division
^ Exponent
%% Mod
%*% Dot product
%/% Integer division
%in% Subset
| Or
& And
< Less
> Greater
<= Less or =
>= Greater or =
! Not
!= Not equal
== Is equal
![Page 21: Introduction to the R language Ahmed Rebai Bioinformatics and Comparative Genome Analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062305/56649e175503460f94b03428/html5/thumbnails/21.jpg)
Frequently used functions
c Concatenate
cbind,rbind
Concatenate vectors
min Minimum
max Maximum
length # values
dim # rows, cols
floor Max integer in
which TRUE indices
table Counts
summary Generic stats
Sort, order, rank
Sort, order, rank a vector
print Show value
cat Print as char
paste c() as char
round Round
apply Repeat over rows, cols
![Page 22: Introduction to the R language Ahmed Rebai Bioinformatics and Comparative Genome Analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062305/56649e175503460f94b03428/html5/thumbnails/22.jpg)
Statistical functionsrnorm, dnorm, pnorm, qnorm
Normal distribution random sample, density, cdf and quantiles
lm, glm, anova Model fitting
loess, lowess Smooth curve fitting
sample Resampling (bootstrap, permutation)
.Random.seed Random number generation
mean, median Location statistics
var, cor, cov, mad, range
Scale statistics
svd, qr, chol, eigen
Linear algebra
![Page 23: Introduction to the R language Ahmed Rebai Bioinformatics and Comparative Genome Analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062305/56649e175503460f94b03428/html5/thumbnails/23.jpg)
Graphical functionsplot Generic plot eg: scatter
points Add points
lines, abline Add lines
text, mtext Add text
legend Add a legend
axis Add axes
box Add box around all axes
par Plotting parameters (lots!)
colors, palette Use colors
![Page 24: Introduction to the R language Ahmed Rebai Bioinformatics and Comparative Genome Analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062305/56649e175503460f94b03428/html5/thumbnails/24.jpg)
Branching
if (logical expression) { statements} else { alternative statements}
else branch is optional{ } are optional with one statement
ifelse (logical expression, yes statement, no statement)
![Page 25: Introduction to the R language Ahmed Rebai Bioinformatics and Comparative Genome Analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062305/56649e175503460f94b03428/html5/thumbnails/25.jpg)
Loops
When the same or similar tasks need to be performed multiple times; for all elements of a list; for all columns of an array; etc.
for(i in 1:10) { print(i*i)}
i<-1while(i<=10) { print(i*i) i<-i+sqrt(i)}
Also: repeat, break, next
![Page 26: Introduction to the R language Ahmed Rebai Bioinformatics and Comparative Genome Analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062305/56649e175503460f94b03428/html5/thumbnails/26.jpg)
Regular ExpressionsTools for text matching and replacement which are available in similar forms in many programming languages (Perl, Unix shells, Java)
> a <- c("CENP-F","Ly-9", "MLN50", "ZNF191", "CLH-17")
> grep("L", a)[1] 2 3 5
> grep("L", a, value=T)[1] "Ly-9" "MLN50" "CLH-17"
> grep("^L", a, value=T)[1] "Ly-9"
> grep("[0-9]", a, value=T)[1] "Ly-9" "MLN50" "ZNF191" "CLH-17"
> gsub("[0-9]", "X", a)[1] "CENP-F" "Ly-X" "MLNXX" "ZNFXXX" "CLH-XX"
![Page 27: Introduction to the R language Ahmed Rebai Bioinformatics and Comparative Genome Analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062305/56649e175503460f94b03428/html5/thumbnails/27.jpg)
Storing Data
Every R object can be stored into and restored from a file with the commands“save” and “load”.
This uses the XDR (external data representation) standard of Sun Microsystems and others, and is portable between MS-Windows, Unix, Mac.
> save(x, file=“x.Rdata”)> load(“x.Rdata”)
![Page 28: Introduction to the R language Ahmed Rebai Bioinformatics and Comparative Genome Analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062305/56649e175503460f94b03428/html5/thumbnails/28.jpg)
Importing and Exporting Data
There are many ways to get data in and out.
Most programs (e.g. Excel), as well as humans, know how to deal with rectangular tables in the form of tab-delimited text files.
> x <- read.delim(“filename.txt”)
Also: read.table, read.csv, scan
> write.table(x, file=“x.txt”, sep=“\t”)
Also: write.matrix, write
![Page 29: Introduction to the R language Ahmed Rebai Bioinformatics and Comparative Genome Analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062305/56649e175503460f94b03428/html5/thumbnails/29.jpg)
Importing Data: caveats
Type conversions: by default, the read functions try to guess and auto convert the data types of the different columns (e.g. number, factor, character). There are options as.is and colClasses to control this.
Special characters: the delimiter character (space, comma, tabulator) and the end-of-line character cannot be part of a data field. To circumvent this, text may be “quoted”. However, if this option is used (the default), then the quote characters themselves cannot be part of a data field. Except if they themselves are within quotes…Understand the conventions your input files use and set the quote options accordingly.
![Page 30: Introduction to the R language Ahmed Rebai Bioinformatics and Comparative Genome Analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062305/56649e175503460f94b03428/html5/thumbnails/30.jpg)
Bioconductor• R software project for the analysis and
comprehension of biomedical and genomic data.– Gene expression arrays (cDNA, Affymetrix)– Pathway graphs– Genome sequence data
• Started in 2001 by Robert Gentleman, Dana Farber Cancer Institute.
• About 25 core developers, at various institutions in the US and Europe.
• Tools for integrating biological metadata from the web (annotation, literature) in the analysis of experimental metadata.
• End-user and developer packages.
![Page 31: Introduction to the R language Ahmed Rebai Bioinformatics and Comparative Genome Analysis](https://reader035.vdocuments.mx/reader035/viewer/2022062305/56649e175503460f94b03428/html5/thumbnails/31.jpg)
Example R/BioC Packages
methods Class/method tools
tools
tkWidgets
Sweave(),gui tools
marrayTools, marrayPlots
Spotted cDNA array analysis
affy Affymetrix array analysis
annotate Link microarray data to metadata on the web
mva, cluster, clust, class
Clustering and classification
t.test, prop.test, wilcox.test
Statistical tests