stat115 stat225 bist512 bio298 - intro to computational biology lab 4 r and bioconductor ii feb 15,...

24
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Lab 4 R and Bioconductor II Feb 15, 2012 Alejandro Quiroz and Daniel Fernandez [email protected] [email protected]

Upload: oswald-watson

Post on 12-Jan-2016

223 views

Category:

Documents


6 download

TRANSCRIPT

Page 1: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Lab 4 R and Bioconductor II Feb 15, 2012 Alejandro Quiroz and Daniel Fernandez dfernan@gmail.com

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Lab 4R and Bioconductor II

Feb 15, 2012

Alejandro Quiroz and Daniel Fernandez

[email protected]@gmail.com

Page 2: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Lab 4 R and Bioconductor II Feb 15, 2012 Alejandro Quiroz and Daniel Fernandez dfernan@gmail.com

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

• www.bioconductor.org– Provides tools for the analysis of high-

throughput genomic data• Software, data, documentation• Training materials • Mailing list

– Based on R• Open to conduct out own analysis

Page 3: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Lab 4 R and Bioconductor II Feb 15, 2012 Alejandro Quiroz and Daniel Fernandez dfernan@gmail.com

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

What can bioconductor do?

Page 4: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Lab 4 R and Bioconductor II Feb 15, 2012 Alejandro Quiroz and Daniel Fernandez dfernan@gmail.com

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Outline

• Installation

• Packages

• Microarray data analysis– Affymetrix files

• Low level analysis

• High level analysis

Page 5: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Lab 4 R and Bioconductor II Feb 15, 2012 Alejandro Quiroz and Daniel Fernandez dfernan@gmail.com

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Installation

• There exist two types of installation– Core packages

>source(“http://bioconductor.org/biocLite.R”)>biocLite()

– Other packages>source(“http://bioconductor.org/biocLite.R”)>biocLite(c(“pkg1”, “pkg2”,…,“pkgN”))

Page 6: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Lab 4 R and Bioconductor II Feb 15, 2012 Alejandro Quiroz and Daniel Fernandez dfernan@gmail.com

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

BioConductor Packages

• View the installed packages:– rownames(installed.packages())

• General infrastructure: Biobase, DynDoc, reposTools, ruuid, tkWidgets, widgetTools, BioStrings, multtest

• Annotation: annotate, AnnBuilder data packages.

• Graphics: geneplotter, hexbin.

• Pre-processing Affymetrix oligonucleotide chip data: affy, affycomp, affydata, makecdfenv, vsn, gcrma

• Pre-processing two-color spotted DNA microarray data: marray, vsn, arrayMagic, arrayQuality

• Differential gene expression: edd, genefilter, limma, ROC, siggenes, EBArrays, factDesign

• Graphs and networks:graph, RBGL, Rgraphviz.

• Other data: SAGElyzer, DNAcopy, PROcess, aCGH

Page 7: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Lab 4 R and Bioconductor II Feb 15, 2012 Alejandro Quiroz and Daniel Fernandez dfernan@gmail.com

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Microarray data analysis

Page 8: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Lab 4 R and Bioconductor II Feb 15, 2012 Alejandro Quiroz and Daniel Fernandez dfernan@gmail.com

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Affymetrix data• Each gene (or portion of a gene) is represented by 11 to

20 oligonucleotides of 25 base-pairs.

• Probe: an oligonucleotide of 25 base-pairs, i.e., a 25-mer.

Page 9: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Lab 4 R and Bioconductor II Feb 15, 2012 Alejandro Quiroz and Daniel Fernandez dfernan@gmail.com

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Affymatrix data• Perfect match (PM): A 25-mer complementary to a reference sequence of

interest (e.g., part of a gene).• Mismatch (MM): same as PM but with a single homomeric base change for

the middle (13th) base (transversion purine <-> pyrimidine, G <->C, A <->T) . – The purpose of the MM probe design is to measure non-specific binding

and background noise.

• Probe-pair: a (PM,MM) pair.• Probe-pair set: a collection of probe-pairs (11 to 20) related to a common

gene or fraction of a gene. • Affy ID: an identifier for a probe-pair set.

Page 10: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Lab 4 R and Bioconductor II Feb 15, 2012 Alejandro Quiroz and Daniel Fernandez dfernan@gmail.com

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Affy Microarray data

• DAT file– Raw (TIFF) optical image of the hybridized chip

• CEL file– Cell intensity file stores the results of the intensity

calculations on the pixel values of the DAT file

• CDF (Chip Description File) – Provided by Affy, describe information about the

probe array design, characteristics, probe utilization and content, and scanning and analysis parameters. These files are unique for each probe array type.

Page 11: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Lab 4 R and Bioconductor II Feb 15, 2012 Alejandro Quiroz and Daniel Fernandez dfernan@gmail.com

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Affymetrix Data Flow

Scan Chip

Hybridized GeneChip

DAT fileDAT fileProcess Image

CEL fileCEL file

CDF fileCDF file

MAS4

MAS5

RMA

Quantile

HighLevel

Analysis

HighLevel

Analysis

Page 12: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Lab 4 R and Bioconductor II Feb 15, 2012 Alejandro Quiroz and Daniel Fernandez dfernan@gmail.com

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Microarray analysis

• Go to and download the data set:– GSE10940

• The R script has to be in the same file of the .cel files

• The data set contains 12 .CEL files– library(affy)– data.affy=ReadAffy()

• What is the name of the CDF file?

• How many genes are considered on the arrays?

• What is the annotation version?

Page 13: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Lab 4 R and Bioconductor II Feb 15, 2012 Alejandro Quiroz and Daniel Fernandez dfernan@gmail.com

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

The data set

• Secretory and transmembrane proteins traverse the endoplasmic reticulum (ER) and Golgi compartments for final maturation prior to reaching their functional destinations.

• Members of the p24 protein family function in trafficking some secretory proteins in yeast and higher eukaryotes.

• Yeast p24 mutants have minor secretory defects and induce an ER stress response that likely results from accumulation of proteins in the ER due to disrupted trafficking.

• Test the hypothesis that loss of Drosophila melanogaster p24 protein function causes a transcriptional response characteristic of ER stress activation.

Page 14: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Lab 4 R and Bioconductor II Feb 15, 2012 Alejandro Quiroz and Daniel Fernandez dfernan@gmail.com

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Page 15: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Lab 4 R and Bioconductor II Feb 15, 2012 Alejandro Quiroz and Daniel Fernandez dfernan@gmail.com

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Looking at RAW dataLow-level analysis

• MA plot MAplot(data.affy, pairs = TRUE, which=c(1,2,3,4),

plot.method = "smoothScatter")

• Image of an array image(data.affy)

• Density of the log intensities of the arrays hist(data.affy)

• Boxplot of the data boxplot(data.affy, col=seq(2,7,by=1))

Page 16: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Lab 4 R and Bioconductor II Feb 15, 2012 Alejandro Quiroz and Daniel Fernandez dfernan@gmail.com

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Normalization

data.rma=rma(data.affy)

• Install the package affyPLM to view the MA plot after normalization (along with dependencies)

MAplot(data.rma, pairs = TRUE, which=c(1,2,3,4), plot.method = "smoothScatter”)

expr.rma=exprs(data.rma) # Puts data in a table boxplot(data.frame(expr.rma), col=seq(2,7,by=1))

Page 17: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Lab 4 R and Bioconductor II Feb 15, 2012 Alejandro Quiroz and Daniel Fernandez dfernan@gmail.com

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Before moving forward…affy probeset names

• rownames(expr.rma)[1:100]

• Suffixes are meaningful, for example:• _at : hybridizes to unique antisense transcript for this

chip• _s_at: all probes cross hybridize to a specified set of

sequences• _a_at: all probes cross hybridize to a specified gene

family• _x_at: at least some probes cross hybridize with other

target sequences for this chip• _r_at: rules dropped• and many more…

Page 18: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Lab 4 R and Bioconductor II Feb 15, 2012 Alejandro Quiroz and Daniel Fernandez dfernan@gmail.com

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Custom CDF files

• The most popular platform for genome-wide expression profiling is the Affymetrix GeneChip.

• However, its selection of probes relied on earlier genome and transcriptome annotation which is significantly different from current knowledge.

• The resultant informatics problems have a profound impact on analysis and interpretation the data.

Page 19: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Lab 4 R and Bioconductor II Feb 15, 2012 Alejandro Quiroz and Daniel Fernandez dfernan@gmail.com

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Custom CDF files

• One solution Dai, M. et. at (2005)

• They reorganized probes on more than a dozen popular 30 GeneChips

• Comparing analysis results between the original and the redefined probe sets – Reveals ~ 30–50% discrepancy in the genes

previously identified as differentially expressed, regardless of analysis method.

Page 20: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Lab 4 R and Bioconductor II Feb 15, 2012 Alejandro Quiroz and Daniel Fernandez dfernan@gmail.com

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Custom CDF files

• Go to:– http://brainarray.mbni.med.umich.edu/

Brainarray/Database/CustomCDF/13.0.0/refseq.asp

– Download the Drosophila melanogaster RefSeq CDF annotation corresponding to the Affy array analyzed

– Install/loaded it on R• R CMD INSTALL… data.affy@cdfName="drosophila2dmrefseqcdf” data.rma.refseq=rma(data.affy) expr.rma.refseq=exprs(data.rma.refseq)

Page 21: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Lab 4 R and Bioconductor II Feb 15, 2012 Alejandro Quiroz and Daniel Fernandez dfernan@gmail.com

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

High-level analysis• Perform a comparison between the control

group and the experimental group– Objective: Obtain the most significant genes

with an FDR of 5% and with a fold change of 1

– Information provided in “SamplePhenotype.csv” to obtain controls and mutant idssample.ids=read.csv("SamplePhenotype.csv",he

ader=F)control=grep("Control",sample.ids[,2])mutants=grep("Logjam",sample.ids[,2])

– Obtain just the RefSeq idsgenes_t=matrix(rownames(expr.rma.refseq))genes.refseq=apply(genes_t,1,function(x)

sub("_at","",x))

Page 22: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Lab 4 R and Bioconductor II Feb 15, 2012 Alejandro Quiroz and Daniel Fernandez dfernan@gmail.com

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

• Calculating the fold change for every gene– foldchange=apply(expr.rma, 1, function(x)

mean( x[mutants] ) - mean( x[control] ) )

• Perform a t-test and obtain the p-values– T.p.value=apply(expr.rma, 1, function(x)

t.test( x[mutants], x[control], var.equal=T )$p.value )

• Calculating the FDR– fdr=p.adjust(T.p.value, method="fdr")

• THE GENES– genes.up=genes.refseq[ which( fdr < 0.05 &

foldchange > 0 ) ]– genes.down=genes.refseq [ which( fdr < 0.05 &

foldchange <0 ) ]

Page 23: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Lab 4 R and Bioconductor II Feb 15, 2012 Alejandro Quiroz and Daniel Fernandez dfernan@gmail.com

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Results

• Provide a .csv file with the list of significant genes with an FDR of 5% and with a fold change of 1

• Provide a heatmap with the significant genes– genes.ids=c(which( fdr < 0.05 & foldchange > 0 ),which(

fdr < 0.05 & foldchange <0 ))– colnames(expr.rma.refseq)=c(rep("Control",6),rep("Muta

nt",6))– heatmap(expr.rma.refseq[genes.ids,],margins=c(5,10))

Page 24: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Lab 4 R and Bioconductor II Feb 15, 2012 Alejandro Quiroz and Daniel Fernandez dfernan@gmail.com

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Beyond the gene list paradigm

• http://david.abcc.ncifcrf.gov