to err is human – to r is divine r from step 1 for the experimental biologist with an eye on the...
Post on 03-Jan-2016
214 Views
Preview:
TRANSCRIPT
To err is human – to R is divine
R from step 1 for the experimental biologist with an
eye on the tomoRRow!
Schraga Schwartz, Bioinformatic Workshop, February 2010
Outline
• Why R
• How R
• iRis
• Down syndRome
• WheRe R
Why ?
R from step 1 for the experimental
biologist with an eye on the tomoRRow!
R programming language is a lot like magic... except instead of spells you have functions.
=muggle
SPSS and Excel users are like muggles. They are limited in their ability to change their environment. The way they approach a problem is constrained by how SPSS/Microsoft employed programmers thought to approach them. And they have to pay money to use these constraining softwares.
=
wizard
R users are like wizards. They can rely on functions (spells) that have been developed for them by statistical researchers, but they can also create their own. They don’t have to pay for the use of them, and once experienced enough (like Dumbledore), they are almost unlimited in their ability to change their environment.
R’s strengths
• Data management & manipulation
• Statistics
• Graphics
• Programming language
• Active user community
• Free!
R’s weakness
• Not user friendly at start.
• Minimal GUI.
• No commercial support
• Substantially slower than programming languages (e.g. perl, java, C++).
R graphics: the sky's the limit!
http://addictedtor.free.fr/graphiques/
How R?
R as a calculator
• Calculator+, -, /, *, ^, log(), exp(), sqrt(), …:
(17*0.35)^(1/3)
log(10)
exp(1)
3^-1
Variables in R
• Variables are assigned using either “=“ or “<-”x=12.6
x
[1] 12.6
Numeric vectors
A vector composed of numbers. Such a vector may be created:1. Using the c() (short for concatenate) function:
y=c(3,7,9,11)
> y[1] 3 7 9 11
2. Using the rep(what,how_many_times) function:y=rep(3,30)
3. Using the “:” operator, signifiying “a series of integers between”y=1:30
Boolean vectors
A boolean variable can be either TRUE or FALSE.
b=c(TRUE,FALSE,TRUE,FALSE,TRUE,TRUE)
sum(b) #number of "TRUE" elements
Vector manipulation
n=c(1,4,5,6,7,2,3,4,5,6) #creates a vector with the numbers in the brackets, stores it in y
length(n) #number of elementsn[3] #extract 3rd element in yn[-2] #extract all of y but 2nd element
n[1:3] #extract first three element of y
n[c(1,3,4)] #extract first, third, and fourth element of y
Vector manipulation…
n+1 #add 1 to all elements in y
n*2 #multiply by two all elements in y
sum(n)
mean(n)
median(n)
var(n)
min(n)
max(n)
log(n) #extract logs from all variables in y
More advanced manipulation
n<4 #returns boolean vector of same length as n, with "TRUE" for each value smaller than 4 and FALSE for all other values.
n[n<4] #extract all elements in y smaller than 4
n[n<4 & n!=1] #extract element smaller than 4 AND different from 1
n[n<4 | n!=1] #extract element smaller than 4 OR different from 1
sum(n[n<4]) #sum of elements in n with values smaller than 4
Fuctions (spells…) in R
- Functions are bits of code which receive something as input (termed: arguments), and produce something as output (termed: return value).
- A function can be recognized by the round brackets "()" following the function name.
- The arguments of the "mean" function is a vector of numbers; the return value is their average.
Basic visualization of numbers
•barplot(n)•plot(n)•hist(n)•boxplot(n)•pie(n)
barplot(n,col="red")
01
23
45
67
plot(n,col="red")
2 4 6 8 10
12
34
56
7
Index
y
hist(n,col="red")Histogram of y
y
Fre
qu
en
cy
1 2 3 4 5 6 7
0.0
0.5
1.0
1.5
2.0
boxplot(n,col="red")
12
34
56
7
pie(n[1:3])
1
2
3
Help in RClick ? + function_name.? barplot
Help pages contain the following components:- function_name(package) – if the package is not installed, this is the
time to install it and call it (using "library")- Description: brief overview- Usage- Description of arguments (input)- Details: more information- Value: value returned by the function (output)- See also: great way to learn new stuff you didn't even know you
wanted to do!- Examples: Can be copy-pasted as is! Highly informative!
Other vectors
Character vectors:nms=c("miriam","schragi","chaim","jochanan","ephraim","avraham","yemima","shakked","ayala","adi")
names(n)=nms #giving names to each value in numeric vector y
n["shakked"]
Class Exercise: Redraw some of the previous plots with modified n!
The paste() function
Concatenates different characters into a single character, separated by the variable defined by sep argument (default: sep=" ")
paste("To","err","is human.","To R is","divine!",sep="_")
Factor vectors (We love factors!)
f=as.factor(c("stupid","stupid","smart","stupid","imbecile","smart","smart","imbecile"))
levels(f) #possible values a variable in y can have
summary(f) #provides the number of time each factor occurs
Class Exercise: Compare summary(n), summary(b), and summary(f) – note difference in output!
The data.frame Class(We also love data.frames!)
• A data.frame is simply a table• Each column may be of a different class
(i.e. one column may be numeric, another may be a character, a third may be boolean and a fourth may be a factor)
• All rows in a given column must be of the same class
• The number of rows in each column must be identical.
age gender disease50 M TRUE43 M FALSE25 F TRUE18 M TRUE72 F FALSE65 M FALSE45 F TRUE
Iris databasePetal (עלה כותרת)
Sepal (עלה גביע)
The iris dataset
The fascinating questions
• What are typical lengths and widths of sepals and petals?
• Do these change from one family of irises to another?
• Do longer petals tend to be wider? • Do longer petals tend to correlate with
longer (or wider) sepals? • Do such correlations change from one
family of irises to another?
Playing with data frames - I
1. Set the work directory to the directory you're working in:
setwd("F:/presentations/R presentation")
(Note: getwd() tells you which directory you're in)
2. Load the table you want to work with (make sure you saved it as tab delimited file!):
ir=read.table(file="iris_dataset.txt",sep="\t",header=T) #loads iris_dataset.txt into variable "ir". Assumes that the file is tab delimited, and that the first line is a header.
Playing with data frames II
class(ir) #shows the class of irdim(ir) #returns the number of rows and columns in ir
ir[1,2] #first line, second column in irir[1,] #all columns in first line in irir[,1] #all rows in first column of irir$seplen #same as aboveir[,"seplen"] #same as aboveir[,c("seplen","sepwid")] OR ir[,1:2] #first two columns of ir
summary(ir) #each of the columns is summarized according to its class
Playing with data frames - III
ir$seplen>6 #returns a boolean vector with TRUE and FALSE values depending on whether seplen is greater than 6
ir[ir$seplen>6,] #returns a subset of ir containing all columns of all rows in which seplen is greater than 6
ir[ir$seplen>6,c("seplen","sepwid")] #returns same rows as above, but only "seplen" and "sepwid" columns
ir[ir$seplen>6 & ir$sepwid >3,c("seplen","sepwid")] #returns same columns as above, but only rows in which seplen is greater than 6 and sepwid is greater than 3
Visualization
hist(ir$seplen) #histogram of seplen
Histogram of ir$seplen
ir$seplen
Fre
qu
en
cy
4 5 6 7 8
05
10
15
20
25
30
Visualization - II
Histogram of ir$seplen
ir$seplen
Fre
qu
en
cy
4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
02
46
81
0
hist(ir$seplen,30) #histogram of seplen
Visualization - III
Distribution of Septal lengths
Mean septal length is 5.84333333333333Lengths of septal (cm)
Fre
qu
en
cy
5 6 7 8
05
10
15
mean_seplen=mean(ir$seplen)
hist(ir$seplen,20,col="light blue",main="Distribution of Septal lengths",xlab="Lengths of septal (cm)",sub=paste("Mean septal length is",mean_seplen))
The tapply() function
Suppose you want to obtain average
ages of patients (a numeric) variable,
as a function of their gender (a factor)
variable. And suppose the data is stored
in the data frame data. The magic spell is:
tapply(data$age,data$gender,mean)
The tapply function – receives three parameters:- A numeric distribution- A factor variable, dividing the numeric distribution into
groups- A function (mean,min,max,sd,sum)
age gender disease50 M TRUE43 M FALSE25 F TRUE18 M TRUE72 F FALSE65 M FALSE45 F TRUE
setosa versicolor virginica
01
23
45
6
mean_per_species=tapply(ir$seplen,ir$species,mean) #calculates the mean value of ir$seplen after dividing it into three groups based on ir$species
barplot(mean_per_species,col="red")
Visualization - IV
Adding packages
Select mirror
Select library
Class exercise
Install the following three libraries:
gplots, lattice,car
These libraries will be used in subsequent examples.
Visualization - Vsd_per_species=tapply(ir$seplen,ir$species,sd) #caculate
standard deviationlibrary(gplots) #loads all functions in gplots into
workspace (including the barplot2 function)barplot2(mean_per_species, plot.ci = T, ci.l =
mean_per_species-sd_per_species, ci.u = mean_per_species+sd_per_species,col="red",ylab="Mean septal lengths")
setosa versicolor virginica
Me
an
se
pta
l le
ng
ths
01
23
45
67
Visualization - VI
library(gplots)
plotmeans(ir$seplen~ir$species,xlab="species",ylab="Sepal length")
5.0
5.5
6.0
6.5
species
Se
pa
l le
ng
th
setosa versicolor virginica
n=50 n=50 n=50
Looking at correlations
plot(ir$petlen,ir$petwid) #plotting one set of numbers as a function of another
1 2 3 4 5 6 7
0.5
1.0
1.5
2.0
2.5
ir$petlen
ir$
pe
twid
Arguments of the plot function
Some parameters of plot() function (get more by typing "? plot.default"):
x – x values (defaults 1:number of points)y – the distributiontype – type: can be either "l" (line), "p" (points) or morepch – type of bullets (values from 19-25)col – color (either numbers of names of colors) – can receive
multiple colorslwd – line widthlty – line typexlab,ylab – X and Y labelsmain, sub – main title (top of chart) and subtitle (beneath the X
label)
More sophisticated plottingplot(ir$petlen,ir$petwid,col=as.numeric(ir$species),p
ch=19,xlab="Petal width",ylab="Petal length")
1 2 3 4 5 6 7
0.5
1.0
1.5
2.0
2.5
Petal width
Pe
tal l
en
gth
And more sophisticated plot, with legend and P values
stat=cor.test(ir$petlen,ir$petwid)rval=stat$estimatepval=stat$p.valueplot(ir$petlen,ir$petwid,col=as.numeric(ir$species),pch=19,xlab="P
etal width",ylab="Petal length",main=paste("R=",rval," ; P=",pval,sep=""))
legend(x="topleft",legend=levels(ir$species),col=1:3,lty=1,lwd=2) #adding a legend
1 2 3 4 5 6 7
0.5
1.0
1.5
2.0
2.5
R=0.962865431402796 ; P=0
Petal width
Pe
tal l
en
gth
setosaversicolorvirginica
Plotting correlations as a function of a third factor variable
library("lattice")
xyplot(ir$seplen ~ ir$sepwid | ir$species)
ir$sepwid
ir$
sep
len
5
6
7
8
2.0 2.5 3.0 3.5 4.0 4.5
setosa versicolor
5
6
7
8virginica
Looking at everything as a function of everything else
pairs(ir[,1:4])
pairs(ir[,1:4],col=ir$Species,upper.panel=NULL)4.5 5.5 6.5 7.5
4.5
5.5
6.5
7.5
Sepal.Length
2.0
3.0
4.0
Sepal.Width
12
34
56
7
Petal.Length
4.5 5.5 6.5 7.5
0.5
1.5
2.5
2.0 3.0 4.0 1 2 3 4 5 6 7 0.5 1.5 2.50.
51.
52.
5
Petal.Width
Even more sophisticated…
library(car)
scatterplot.matrix(ir[,1:4],groups=ir$Species,ellipse=T,levels=0.95,upper.panel=NULL, smooth=F)
4.5 5.5 6.5 7.5
4.5
5.5
6.5
7.5
|||| | || || | |||| |||| || ||| || || |||| || ||| ||| |||| ||| || || || || || || ||| |||| || | || | | || | | |||||| | || | ||||| ||| ||| || | || || | || || ||| ||| || ||| || || | ||| | | | |||| |||| || || |||| |||
Sepal.Length
2.0
3.0
4.0
|| || | |||| | |||| | ||| ||| ||| || | |||| | | || | ||| | || | | || || |||||| || || ||| || || |||| | ||| | | || ||||| || | ||| || | ||| | |||| | || || ||| || ||| || | || ||| |||| ||| || || |||| | ||| |||| | ||| | ||
Sepal.Width
12
34
56
7
||||| |||||||||| ||| ||||| |||||| |||||||||||||| | || |||| || || |||| ||| || || ||| || || |||| |||| ||| ||||||| | ||| ||||| | || ||| || || |||||||| ||| || || | ||| || | ||| | |||| | ||| |||||||
Petal.Length
4.5 5.5 6.5 7.5
0.5
1.5
2.5
2.0 3.0 4.0 1 2 3 4 5 6 7 0.5 1.5 2.5
0.5
1.5
2.5
||||| |||||||||| |||||| || ||| ||||| ||||||||| ||| ||||||| |||| || || | || || || ||| || || || | || ||| || | || |||||| ||| || ||| | || || ||| || ||| || ||| | || |||| |||| || | | ||| | ||| | ||| | ||| | ||
Petal.Width
setosa
versicolor
virginica
And more (for the highly motivated or extremly bored…)
upperpanel.cor <- function(x, y,method="pearson",digits=2,...) { points(x,y,type="n"); usr <- par("usr"); on.exit(par(usr)) par(usr = c(0, 1, 0, 1)); correl <- cor.test(x, y,method=method); r=correl$estimate; pval=correl$p.value; color="black"; if (pval<0.05) color="blue";
txt <- format(r,digits=2) pval <- format(pval,digits=2) txt <- paste("r=", txt, "\npval=",pval,sep="") text(0.5, 0.5, txt,col=color)}
scatterplot.matrix(ir[,1:4],groups=ir$Species,ellipse=T,levels=0.95,upper.panel=upperpanel.cor,cex=0.3,smooth=F,main="This is cool!!!")
Final output
|||| | || || | |||| |||| || ||| || || ||| | || || | ||| |||| | || || || || || || || ||| | | || || | || | | || | | |||||| | || | ||||| ||| | || || | || || | || || ||| || | | | ||| || || | ||| | | | |||| || || || || |||| |||
Sepal.Length
2.0 2.5 3.0 3.5 4.0
r=-0.12pval=0.15
r=0.87pval=0
0.5 1.0 1.5 2.0 2.5
4.5
5.5
6.5
7.5
r=0.82pval=0
2.0
2.5
3.0
3.5
4.0
|| || | |||| | |||| | ||| ||| ||| || | |||| | | || | | || | || | | || || |||||| || || ||| || || |||| | ||| | | || ||||| || | ||| || | ||| | |||| | || || ||| || ||| || | || ||| |||| ||| || || |||| | ||| |||| | ||| | ||
Sepal.Width
r=-0.43pval=4.5e-08
r=-0.37pval=4.1e-06
||| || || || |||||| ||| || ||| | ||||| ||||| || ||| |||| | || || || || || || || ||| || || ||| || || ||| | | ||| || | ||| |||| | ||| ||||| | || || | || || || | || | | | | || || || | ||| | | | ||| | |||| | ||| |||| | ||
Petal.Length
12
34
56
7
r=0.96pval=0
4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
0.5
1.0
1.5
2.0
2.5
1 2 3 4 5 6 7
||||| ||||| |||| | |||||| || ||| ||||| || ||||| || ||| ||||||| | ||| || || | || || || | || || || || | || ||| || | || |||||| ||| || ||| | || || ||| || ||| || ||| | || |||| |||| || | | ||| | ||| | ||| | ||| | ||
Petal.Width
This is cool!!!
setosa
versicolor
virginica
Saving Graphics to Files
• Before running the visualizing function, redirect all plots to a file of a certain type. Possibilities:– jpeg(filename)
– png(filename)
– pdf(filename)
– postscript(filename)
• After running the visualization function, close graphic device using dev.off()
Saving graphics
Example: pdf("F:/test.pdf")barplot(1:10,col="red")dev.off()
Note:Different graphic functions can also receive arguments regarding width and height of canvas. Use "?" + function name (e.g. ?jpeg to obtain arguments)
Statistics
• t.test #Student t test• wilcox.test #Mann-Whitney test • kruskal.test #Kruskal-Wallis rank sum test • chisq.test #chi squared test• cor.test #pearson/spearman correlations
• lm(),glm() #linear and generalized linear models• p.adjust #adjustment of P values to multiple
testing using FDR, bonferroni, or whatnot…
Down Syndrome
The fascinating research question
Do genes from any particular chromosome alter their expression levels in Down syndrome?
GEO database: A paradise of numbers
Getting the data!
Loading the data (look at it first!)
setwd("F:/Presentations/R presentation/") #sets the work directory
a=read.table(file="GSE5390_series_matrix.txt",sep="\t",header=T,comment.char="!") #loads the gene expression values and stores them in a
names(a)=c("id","down1","down2","down3","down4","down5","down6","down7","healty1","healty2","healty3","healty4","healty5","healty6","healty7","healty8") #give informative names to columns in a
The merge() function
name age gender diseaseinky 50 M TRUEdinky 43 M FALSEJose 25 F TRUEOleg 18 M TRUEJesus 72 F FALSECleopatra 65 M FALSEStanislav 45 F TRUE
name IQdinky 43Cleopatra 750Stanislav 565
name age gender disease IQdinky 43 M FALSE 43Cleopatra 65 M FALSE 750Stanislav 45 F TRUE 565
a= b=
merge(a,b,by="name") OR merge(a,b,by.x="name",by.y="name")
Merging data
convert=read.table(file="convert_affyprobes_2_chromosome_location_from_UCSC.txt",sep="\t",header=T)
b=merge(a,convert,by="id") #merges a and convert by the columns indicated by the by arguments. In other words, the column "id" in "a" is compared to the column "id" in "convert". Only lines in which the two values are identical are retained, yielding a new data frame with shared values & shared information.
Assign informative names
downcols=2:8
healthycols=9:16
allarraycols=c(downcols,healthycols)
Calculate Fold Change between disease and healthy
Step 1: calculate mean expression values for all patients with Down syndrome
b$meandown=apply(b[,downcols],1,mean)
Step 2: calculate mean expression values for all healthy subjectsb$meanhealthy=apply(b[,healthycols],1,mean)
Step 3: Calculate difference between the two (since data is log transformed)
b$dif=b$meandown-b$meanhealthy
Step 4: anti-log the fold changeb$foldchange=2^b$dif
Calculate P values
Step 1: Create function which receives a line as input, and knows how to break it up into disease and control groups and yield a p value
GetPval=function(line) { ttest=t.test(line[downcols-1],line[healthycols-
1])ttest$p.value
}
Step 2: Apply this function to all rows of the data frameb$pval=apply(b[,allarraycols],1,GetPval)
Step 3: Adjust P value to multiple testingb$adjustedPval=p.adjust(b$pval,method="fdr")
Saving data frames to a file
write.table(b,file="DownWithPvals.txt",sep="\t",row.names=F,col.names=T) #generates a tab-delimited file with column names, without row names containing the data in the data frame b
Finding significant events
sigs=b[b$foldchange>1.75 & b$adjustedPval<0.01,] #finding events with significant fold change and significant P values
sigs=sigs[order(sigs$adjustedPval,decreasing=T),] #sorting table based on P values
Finding and plotting % significantly over/under expressed genes per
chromosome
percentages=summary(sigs$chr)*100/summary(b$chr) #divides the number of times each chrosome appears in "sigs" by number of time it appears in original data
barplot(percentages,las=3,col="light blue",ylab="% significant genes",main="To R is divine!") #barplot depicting the percentage of genes from each chromosome within sig
chr1
chr1
_ra
nd
om
chr1
0ch
r11
chr1
2ch
r13
chr1
3_
ran
do
mch
r14
chr1
5ch
r15
_ra
nd
om
chr1
6ch
r16
_ra
nd
om
chr1
7ch
r17
_ra
nd
om
chr1
8ch
r19
chr1
9_
ran
do
mch
r2ch
r2_
ran
do
mch
r20
chr2
1ch
r21
_ra
nd
om
chr2
2ch
r22
_h
2_
ha
p1
chr2
2_
ran
do
mch
r3ch
r3_
ran
do
mch
r4ch
r4_
ran
do
mch
r5ch
r5_
h2
_h
ap
1ch
r5_
ran
do
mch
r6ch
r6_
cox_
ha
p1
chr6
_q
bl_
ha
p2
chr6
_ra
nd
om
chr7
chr7
_ra
nd
om
chr8
chr8
_ra
nd
om
chr9
chr9
_ra
nd
om
chrM
chrX
chrX
_ra
nd
om
chrY
R - for a better tomoRRow!
% s
ign
ifica
nt g
en
es
01
23
45
Even better plot…
validchrs=c(paste("chr",1:22,sep=""),"chrX","chrY")
percentages=percentages[validchrs]
barplot(percentages,las=3,col="light blue",ylab="% significant genes",main="R - for a better tomoRRow!")
Results…ch
r1
chr2
chr3
chr4
chr5
chr6
chr7
chr8
chr9
chr1
0
chr1
1
chr1
2
chr1
3
chr1
4
chr1
5
chr1
6
chr1
7
chr1
8
chr1
9
chr2
0
chr2
1
chr2
2
chrX
chrY
To R is divine!
% s
ign
ifica
nt g
en
es
01
23
45
Volcano plots: P values as a measure of fold change
plot(log2(b$foldchange),-log2(b$pval),col=(b$chr=="chr21")+1,pch=19,xlab="log fold-change",ylab="-log P value")
legend(x="topleft",legend=c("non chr 21","chr 21"),lty=1,col=1:2,lwd=3)
abline(h=-log2(0.001),col="blue",lty=3)abline(v=c(log2(1.75),-log2(1.75)),col="blue",lty=3)text(2,17,"Significantly\nOver-represented",col="blue")text(-1.4,17,"Significantly\nUnder-represented",col="blue")
abline() function: adds either horizontal or vertical line/s (as well as more sophisticated stuff as well), depending on whether the "h" or "v" arguments are populated
text() function: receives x,y coordinates|on plot, as well as text to plot
Volcano plot
A particular R strength: genetics
• Bioconductor is a suite of additional functions and some 200 packages dedicated to analysis, visualization, and management of genetic data
• Much more functionality than software released by Affy or Illumina
Where R?
Choose server…
Click on “Windows”
Click “base”
Click on “Download” link and follow installation guidelines…
There you R!
Installing Tinn-R
• Go to: http://www.sciviews.org/Tinn-R/
• Scroll to bottom of page
Loading R from within Tinn-R
Final Tips• Use http://www.rseek.org/ & google for finding help on
what you want• Know your objects’ classes: class(x)• Know your functions arguments. Use "?
function_name" to learn what arguments a function receives & what its return values are.
• Each help files provides examples, which can be copy-pasted into R as is. Extremely useful!
• MOST IMPORTANT - the more time you spend using R, the more comfortable you become with it. DESPAIR NOT – and you will never look back!
Final Words of Warning
• “Using R is a bit akin to smoking. The beginning is difficult, one may get headaches and even gag the first few times. But in the long run,it becomes pleasurable and even addictive. Yet, deep down, for those willing to be honest, there is something not fully healthy in it.” --Francois Pinard
R
Thank you!
May the R be with you!
Quick hands-on
• Generate a numeric vector called a containing the number 1,3,4,5,9.
• Calculate the square root (sqrt) of the values in a.
• Create a barplot displaying a• Show a as a regular plot, showing the
values in red.• Label the x-axis of the plot "R is gReat",
and the y-axis "I love R".
Hands-On - II
Based on the down syndrome microarrays:
- Find the 10 genes showing the highest differences between healthy and sick.
- Create bar plots showing the average values in sick, and in healthy for those ten genes.
- For true geeks: Add error bars to the graph.
Todo
• multiple panels• lists, loops, lapply, sapply• regular expressions
top related