stat115 stat225 bist512 bio298 - intro to computational biology lab 4 r and bioconductor ii feb 15,...
TRANSCRIPT
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Lab 4R and Bioconductor II
Feb 15, 2012
Alejandro Quiroz and Daniel Fernandez
[email protected]@gmail.com
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
• www.bioconductor.org– Provides tools for the analysis of high-
throughput genomic data• Software, data, documentation• Training materials • Mailing list
– Based on R• Open to conduct out own analysis
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
What can bioconductor do?
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Outline
• Installation
• Packages
• Microarray data analysis– Affymetrix files
• Low level analysis
• High level analysis
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Installation
• There exist two types of installation– Core packages
>source(“http://bioconductor.org/biocLite.R”)>biocLite()
– Other packages>source(“http://bioconductor.org/biocLite.R”)>biocLite(c(“pkg1”, “pkg2”,…,“pkgN”))
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
BioConductor Packages
• View the installed packages:– rownames(installed.packages())
• General infrastructure: Biobase, DynDoc, reposTools, ruuid, tkWidgets, widgetTools, BioStrings, multtest
• Annotation: annotate, AnnBuilder data packages.
• Graphics: geneplotter, hexbin.
• Pre-processing Affymetrix oligonucleotide chip data: affy, affycomp, affydata, makecdfenv, vsn, gcrma
• Pre-processing two-color spotted DNA microarray data: marray, vsn, arrayMagic, arrayQuality
• Differential gene expression: edd, genefilter, limma, ROC, siggenes, EBArrays, factDesign
• Graphs and networks:graph, RBGL, Rgraphviz.
• Other data: SAGElyzer, DNAcopy, PROcess, aCGH
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Microarray data analysis
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Affymetrix data• Each gene (or portion of a gene) is represented by 11 to
20 oligonucleotides of 25 base-pairs.
• Probe: an oligonucleotide of 25 base-pairs, i.e., a 25-mer.
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Affymatrix data• Perfect match (PM): A 25-mer complementary to a reference sequence of
interest (e.g., part of a gene).• Mismatch (MM): same as PM but with a single homomeric base change for
the middle (13th) base (transversion purine <-> pyrimidine, G <->C, A <->T) . – The purpose of the MM probe design is to measure non-specific binding
and background noise.
• Probe-pair: a (PM,MM) pair.• Probe-pair set: a collection of probe-pairs (11 to 20) related to a common
gene or fraction of a gene. • Affy ID: an identifier for a probe-pair set.
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Affy Microarray data
• DAT file– Raw (TIFF) optical image of the hybridized chip
• CEL file– Cell intensity file stores the results of the intensity
calculations on the pixel values of the DAT file
• CDF (Chip Description File) – Provided by Affy, describe information about the
probe array design, characteristics, probe utilization and content, and scanning and analysis parameters. These files are unique for each probe array type.
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Affymetrix Data Flow
Scan Chip
Hybridized GeneChip
DAT fileDAT fileProcess Image
CEL fileCEL file
CDF fileCDF file
MAS4
MAS5
RMA
Quantile
HighLevel
Analysis
HighLevel
Analysis
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Microarray analysis
• Go to and download the data set:– GSE10940
• The R script has to be in the same file of the .cel files
• The data set contains 12 .CEL files– library(affy)– data.affy=ReadAffy()
• What is the name of the CDF file?
• How many genes are considered on the arrays?
• What is the annotation version?
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
The data set
• Secretory and transmembrane proteins traverse the endoplasmic reticulum (ER) and Golgi compartments for final maturation prior to reaching their functional destinations.
• Members of the p24 protein family function in trafficking some secretory proteins in yeast and higher eukaryotes.
• Yeast p24 mutants have minor secretory defects and induce an ER stress response that likely results from accumulation of proteins in the ER due to disrupted trafficking.
• Test the hypothesis that loss of Drosophila melanogaster p24 protein function causes a transcriptional response characteristic of ER stress activation.
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Looking at RAW dataLow-level analysis
• MA plot MAplot(data.affy, pairs = TRUE, which=c(1,2,3,4),
plot.method = "smoothScatter")
• Image of an array image(data.affy)
• Density of the log intensities of the arrays hist(data.affy)
• Boxplot of the data boxplot(data.affy, col=seq(2,7,by=1))
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Normalization
data.rma=rma(data.affy)
• Install the package affyPLM to view the MA plot after normalization (along with dependencies)
MAplot(data.rma, pairs = TRUE, which=c(1,2,3,4), plot.method = "smoothScatter”)
expr.rma=exprs(data.rma) # Puts data in a table boxplot(data.frame(expr.rma), col=seq(2,7,by=1))
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Before moving forward…affy probeset names
• rownames(expr.rma)[1:100]
• Suffixes are meaningful, for example:• _at : hybridizes to unique antisense transcript for this
chip• _s_at: all probes cross hybridize to a specified set of
sequences• _a_at: all probes cross hybridize to a specified gene
family• _x_at: at least some probes cross hybridize with other
target sequences for this chip• _r_at: rules dropped• and many more…
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Custom CDF files
• The most popular platform for genome-wide expression profiling is the Affymetrix GeneChip.
• However, its selection of probes relied on earlier genome and transcriptome annotation which is significantly different from current knowledge.
• The resultant informatics problems have a profound impact on analysis and interpretation the data.
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Custom CDF files
• One solution Dai, M. et. at (2005)
• They reorganized probes on more than a dozen popular 30 GeneChips
• Comparing analysis results between the original and the redefined probe sets – Reveals ~ 30–50% discrepancy in the genes
previously identified as differentially expressed, regardless of analysis method.
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Custom CDF files
• Go to:– http://brainarray.mbni.med.umich.edu/
Brainarray/Database/CustomCDF/13.0.0/refseq.asp
– Download the Drosophila melanogaster RefSeq CDF annotation corresponding to the Affy array analyzed
– Install/loaded it on R• R CMD INSTALL… data.affy@cdfName="drosophila2dmrefseqcdf” data.rma.refseq=rma(data.affy) expr.rma.refseq=exprs(data.rma.refseq)
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
High-level analysis• Perform a comparison between the control
group and the experimental group– Objective: Obtain the most significant genes
with an FDR of 5% and with a fold change of 1
– Information provided in “SamplePhenotype.csv” to obtain controls and mutant idssample.ids=read.csv("SamplePhenotype.csv",he
ader=F)control=grep("Control",sample.ids[,2])mutants=grep("Logjam",sample.ids[,2])
– Obtain just the RefSeq idsgenes_t=matrix(rownames(expr.rma.refseq))genes.refseq=apply(genes_t,1,function(x)
sub("_at","",x))
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
• Calculating the fold change for every gene– foldchange=apply(expr.rma, 1, function(x)
mean( x[mutants] ) - mean( x[control] ) )
• Perform a t-test and obtain the p-values– T.p.value=apply(expr.rma, 1, function(x)
t.test( x[mutants], x[control], var.equal=T )$p.value )
• Calculating the FDR– fdr=p.adjust(T.p.value, method="fdr")
• THE GENES– genes.up=genes.refseq[ which( fdr < 0.05 &
foldchange > 0 ) ]– genes.down=genes.refseq [ which( fdr < 0.05 &
foldchange <0 ) ]
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Results
• Provide a .csv file with the list of significant genes with an FDR of 5% and with a fold change of 1
• Provide a heatmap with the significant genes– genes.ids=c(which( fdr < 0.05 & foldchange > 0 ),which(
fdr < 0.05 & foldchange <0 ))– colnames(expr.rma.refseq)=c(rep("Control",6),rep("Muta
nt",6))– heatmap(expr.rma.refseq[genes.ids,],margins=c(5,10))
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Beyond the gene list paradigm
• http://david.abcc.ncifcrf.gov