stat115 stat225 bist512 bio298 - intro to computational biology lab 4 r and bioconductor ii feb 15,...

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Lab 4R and Bioconductor II

Feb 15, 2012

Alejandro Quiroz and Daniel Fernandez

[email protected]@gmail.com


• www.bioconductor.org– Provides tools for the analysis of high-

throughput genomic data• Software, data, documentation• Training materials • Mailing list

– Based on R• Open to conduct out own analysis


What can bioconductor do?


Outline

• Installation

• Packages

• Microarray data analysis– Affymetrix files

• Low level analysis

• High level analysis


Installation

• There exist two types of installation– Core packages

>source(“http://bioconductor.org/biocLite.R”)>biocLite()

– Other packages>source(“http://bioconductor.org/biocLite.R”)>biocLite(c(“pkg1”, “pkg2”,…,“pkgN”))


BioConductor Packages

• View the installed packages:– rownames(installed.packages())

• General infrastructure: Biobase, DynDoc, reposTools, ruuid, tkWidgets, widgetTools, BioStrings, multtest

• Annotation: annotate, AnnBuilder data packages.

• Graphics: geneplotter, hexbin.

• Pre-processing Affymetrix oligonucleotide chip data: affy, affycomp, affydata, makecdfenv, vsn, gcrma

• Pre-processing two-color spotted DNA microarray data: marray, vsn, arrayMagic, arrayQuality

• Differential gene expression: edd, genefilter, limma, ROC, siggenes, EBArrays, factDesign

• Graphs and networks:graph, RBGL, Rgraphviz.

• Other data: SAGElyzer, DNAcopy, PROcess, aCGH


Microarray data analysis


Affymetrix data• Each gene (or portion of a gene) is represented by 11 to

20 oligonucleotides of 25 base-pairs.

• Probe: an oligonucleotide of 25 base-pairs, i.e., a 25-mer.


Affymatrix data• Perfect match (PM): A 25-mer complementary to a reference sequence of

interest (e.g., part of a gene).• Mismatch (MM): same as PM but with a single homomeric base change for

the middle (13th) base (transversion purine <-> pyrimidine, G <->C, A <->T) . – The purpose of the MM probe design is to measure non-specific binding

and background noise.

• Probe-pair: a (PM,MM) pair.• Probe-pair set: a collection of probe-pairs (11 to 20) related to a common

gene or fraction of a gene. • Affy ID: an identifier for a probe-pair set.


Affy Microarray data

• DAT file– Raw (TIFF) optical image of the hybridized chip

• CEL file– Cell intensity file stores the results of the intensity

calculations on the pixel values of the DAT file

• CDF (Chip Description File) – Provided by Affy, describe information about the

probe array design, characteristics, probe utilization and content, and scanning and analysis parameters. These files are unique for each probe array type.


Affymetrix Data Flow

Scan Chip

Hybridized GeneChip

DAT fileDAT fileProcess Image

CEL fileCEL file

CDF fileCDF file

MAS4

MAS5

RMA

Quantile

HighLevel

Analysis

HighLevel

Analysis


Microarray analysis

• Go to and download the data set:– GSE10940

• The R script has to be in the same file of the .cel files

• The data set contains 12 .CEL files– library(affy)– data.affy=ReadAffy()

• What is the name of the CDF file?

• How many genes are considered on the arrays?

• What is the annotation version?


The data set

• Secretory and transmembrane proteins traverse the endoplasmic reticulum (ER) and Golgi compartments for final maturation prior to reaching their functional destinations.

• Members of the p24 protein family function in trafficking some secretory proteins in yeast and higher eukaryotes.

• Yeast p24 mutants have minor secretory defects and induce an ER stress response that likely results from accumulation of proteins in the ER due to disrupted trafficking.

• Test the hypothesis that loss of Drosophila melanogaster p24 protein function causes a transcriptional response characteristic of ER stress activation.


Looking at RAW dataLow-level analysis

• MA plot MAplot(data.affy, pairs = TRUE, which=c(1,2,3,4),

plot.method = "smoothScatter")

• Image of an array image(data.affy)

• Density of the log intensities of the arrays hist(data.affy)

• Boxplot of the data boxplot(data.affy, col=seq(2,7,by=1))


Normalization

data.rma=rma(data.affy)

• Install the package affyPLM to view the MA plot after normalization (along with dependencies)

MAplot(data.rma, pairs = TRUE, which=c(1,2,3,4), plot.method = "smoothScatter”)

expr.rma=exprs(data.rma) # Puts data in a table boxplot(data.frame(expr.rma), col=seq(2,7,by=1))


Before moving forward…affy probeset names

• rownames(expr.rma)[1:100]

• Suffixes are meaningful, for example:• _at : hybridizes to unique antisense transcript for this

chip• _s_at: all probes cross hybridize to a specified set of

sequences• _a_at: all probes cross hybridize to a specified gene

family• _x_at: at least some probes cross hybridize with other

target sequences for this chip• _r_at: rules dropped• and many more…


Custom CDF files

• The most popular platform for genome-wide expression profiling is the Affymetrix GeneChip.

• However, its selection of probes relied on earlier genome and transcriptome annotation which is significantly different from current knowledge.

• The resultant informatics problems have a profound impact on analysis and interpretation the data.


Custom CDF files

• One solution Dai, M. et. at (2005)

• They reorganized probes on more than a dozen popular 30 GeneChips

• Comparing analysis results between the original and the redefined probe sets – Reveals ~ 30–50% discrepancy in the genes

previously identified as differentially expressed, regardless of analysis method.


Custom CDF files

• Go to:– http://brainarray.mbni.med.umich.edu/

Brainarray/Database/CustomCDF/13.0.0/refseq.asp

– Download the Drosophila melanogaster RefSeq CDF annotation corresponding to the Affy array analyzed

– Install/loaded it on R• R CMD INSTALL… data.affy@cdfName="drosophila2dmrefseqcdf” data.rma.refseq=rma(data.affy) expr.rma.refseq=exprs(data.rma.refseq)


High-level analysis• Perform a comparison between the control

group and the experimental group– Objective: Obtain the most significant genes

with an FDR of 5% and with a fold change of 1

– Information provided in “SamplePhenotype.csv” to obtain controls and mutant idssample.ids=read.csv("SamplePhenotype.csv",he

ader=F)control=grep("Control",sample.ids[,2])mutants=grep("Logjam",sample.ids[,2])

– Obtain just the RefSeq idsgenes_t=matrix(rownames(expr.rma.refseq))genes.refseq=apply(genes_t,1,function(x)

sub("_at","",x))


• Calculating the fold change for every gene– foldchange=apply(expr.rma, 1, function(x)

mean( x[mutants] ) - mean( x[control] ) )

• Perform a t-test and obtain the p-values– T.p.value=apply(expr.rma, 1, function(x)

t.test( x[mutants], x[control], var.equal=T )$p.value )

• Calculating the FDR– fdr=p.adjust(T.p.value, method="fdr")

• THE GENES– genes.up=genes.refseq[ which( fdr < 0.05 &

foldchange > 0 ) ]– genes.down=genes.refseq [ which( fdr < 0.05 &

foldchange <0 ) ]


Results

• Provide a .csv file with the list of significant genes with an FDR of 5% and with a fold change of 1

• Provide a heatmap with the significant genes– genes.ids=c(which( fdr < 0.05 & foldchange > 0 ),which(

fdr < 0.05 & foldchange <0 ))– colnames(expr.rma.refseq)=c(rep("Control",6),rep("Muta

nt",6))– heatmap(expr.rma.refseq[genes.ids,],margins=c(5,10))


Beyond the gene list paradigm

• http://david.abcc.ncifcrf.gov

stat115 stat225 bist512 bio298 - intro to computational biology lab 4 r and bioconductor ii feb 15,...

Documents