gene expression data analysis (i ... - vanderbilt...

Gene Expression Data Analysis (I)

Bing Zhang Department of Biomedical Informatics

Vanderbilt University

[email protected]

Bioinformatics tasks

Microarray experiment

Data storage Data integration

Data visualization

Biological question

Experiment design

Image analysis

Pre-processing

Data Mining

Differential expression

Clustering

Classification

Hypothesis Experimental verification

Biological interpretation

BMIF 310, Fall 2010

Experiment design: well begun is half done

  A clearly defined biological question

  Well control of potential sources of variation (biological and technical)   Statistically sound microarray experimental arrangement (replicates)

  Compliance with the standard of microarray information collection (MIAME)

  http://www.mged.org/Workgroups/MIAME/miame.html

BMIF 310, Fall 2010

Image analysis

  Analysis of the image of the scanned array in order to extract an intensity for each spot or feature on the array.

  Gridding: align a grid to the spots

  Segmentation: identify the shape of each spot

  Intensity extraction: extract intensity for each spot and potentially for each surrounding background

BMIF 310, Fall 2010

Fixed circle

Adaptive circle

Seeded region

growing

Data preprocessing

  Background correction: subtract background signal from the spot intensity to get a more accurate estimate of the biological signal from the spot   Local subtraction: MM probe or local background

  Model-based correction: signal component + noise component

  Usually leads to an increased level of noise for low-expressing probe sets

  Normalization: remove systematic variation in a microarray experiment which affects the measured gene expression levels   Experimenter bias

  Variability in experimental conditions

  Sample collection and preparation

  Machine parameters

  Summarization: combine the multiple probe intensities for each probe set to produce an expression value estimate (for affymetrix arrays)   RMA model:

  k, j, i refer to probe set, array and probe, respectively

BMIF 310, Fall 2010 €

ykij = βkj +αki +ε kij

Normalization (within array, two-channel arrays)

  Remove systematic differences due to intensity and location dependent dye biases

  MA-plot   M=log2(Cy5/Cy3)

  A=(log2(Cy5)+log2(Cy3))/2

  Normalization   Global lowess (Locally weighted scatter

plot smoothing) normalization: a non linear regression of log2(ratio)s against the average log2(intensity). It computes local linear regressions that are joined together to form a smooth curve. This normalization takes into account intensity artifacts.

  Print-tip lowess normalization: local linear regression computation are limited to a single print-tip group

BMIF 310, Fall 2010

Before normalization Global lowess normalization

Before normalization Print-tip lowess normalization

Normalization (between arrays)

  Adjust the arrays using some control or housekeeping genes that you would expect to have the same intensity level across all of the samples

  Adjust using spike control

  Multiply each array by a constant to make the mean (median) intensity the same for each individual array (Global normalization)

  Match the percentiles of each array (Quantile normalization)

BMIF 310, Fall 2010

No normalization Global normalization Quantile normalization

Get to know your data matrix

BMIF 310, Fall 2010

Sample_1 Sample_2 Sample_3 Sample_4 Sample_5 Sample_6 TNNC1 14.82 14.46 14.76 11.22 11.55 11.18 DKK4 10.71 10.37 11.23 19.74 19.73 18.78 ZNF185 15.20 14.96 15.07 12.57 12.37 12.10 CHST3 13.40 13.18 13.15 11.18 10.99 11.03 FABP3 15.87 15.80 15.85 13.16 12.99 13.05 MGST1 12.76 12.80 12.67 14.92 15.02 15.32 DEFA5 10.63 10.47 10.54 15.52 15.52 14.37 VIL1 11.47 11.69 11.87 13.94 14.01 13.72 AKAP12 18.26 18.10 18.50 15.60 15.69 15.62 HS3ST1 10.61 10.67 10.50 12.44 12.23 12.61 …… …… …… …… …… …… ……

Gen

es

Samples




Data visualization

Biological question

Experiment design

Image analysis

Pre-processing

Data Mining


Clustering

Classification



BMIF 310, Fall 2010


BMIF 310, Fall 2010


Gen

es

Samples

Case Control

Fold change

  n-fold change   Arbitrarily selected fold change

cut-offs

  Usually ≥ 2 fold

  Pros   Intuitive and easily visualised

  Simple and rapid

  Cons   Statistically inefficient

  Magnitude does not necessarily indicate importance

  Often too restrictive

MA-plot

M: log ratio ( log2(A/B) )

A: average log intensity ( log2(A*B)/2 )

BMIF 310, Fall 2010

Statistical analysis

BMIF 310, Fall 2010


Gen

es

Samples

Case Control

Null hypothesis

Alternative hypothesis

€

H0 : µ1 = µ2

€

H1 : µ1 ≠ µ2

Planning experiments for case-control studies

  Simulation of the dependency of fold change detection on the sample size. Experimental error is assumed to be 20%, i.e., CV of replicated control and treatment series equals 0.2. Samples are drawn from Gaussian distributions with mean equal to 1 for the control series and mean equal to 1.5 (black), 2 (red), 2.5 (green), 3 (blue), 5 (yellow), and 10 (magenta) for the treatment samples, respectively, in order to simulate the fold changes. Sampling is repeated 1000 times and the proportion of true positive test results (P < 0.05) is plotted (Y-axis) over the sample size (X-axis).

BMIF 310, Fall 2010

Differential Gene Expression: DNA arrays (continuous data)   Statistical tests

  Student’s t-test: a two sample location test of the null hypothesis that the means of two normally distributed populations are equal (equal variance).

  Welch’s t-test: unequal variance

  Mann–Whitney U test (also called Wilcoxon rank-sum test): nonparametric

  t-test vs U-test   Robustness: U-test is more robust to outliers

  Efficiency: When normality holds, the efficiency of the U-test is about 0.95 when compared to the t-test. For distributions sufficiently far from normal and for sufficiently large sample sizes, the U-test can be considerably more efficient than the t-test.

BMIF 310, Fall 2010

t-test: p=0.06; U test: p=0.1 GeneX 9.61 11.03 10.50 11.44 12.23 13.61 GeneX 9.61 11.03 10.50 11.44 12.23 25.61

t-test: p=0.32; U test: p=0.1

Differential Gene Expression: sequencing-based technologies (count data)   2 x 2 contingency table

  Statistical tests   Chi-square test

  Fisher’s exact test

  Poisson regression

BMIF 310, Fall 2010

Counts in case

Counts in control

Total

Counts for gene X a b a+b

Counts for all other genes c d c+d

Total a+c b+d a+b+c+d

Correction for multiple testing

  Why?   In an experiment with a 10,000-gene array in which the significance level

p is set at 0.05, 10,000 x 0.05 = 500 genes would be inferred as significant even though none is differentially expressed

  The probability of drawing the wrong conclusion in at least one of the n different test is

Where is the significance level at single gene level, and is the global significance level.

BMIF 310, Fall 2010

€

P(wrong) =1− (1−α s)n = αg

€

αg

€

α s

Correction for multiple testing

  Methods   Control the family-wise error rate (FWER), the probability that there is a

single type I error in the entire set (family) of hypotheses tested. e.g. Standard Bonferroni Correction.   uncorrected p value x no. of genes tested

  Control the false discovery rate (FDR), the expected proportion of false positives among the number of rejected hypotheses. e.g. Benjamini and Hochberg correction.   Ranking all genes according to their p value

  Picking a desired FDR level, q (e.g. 5%)

  Starting from the top of the list, accept all genes with , where i is the number of genes accepted so far, and m is the total number of genes tested.

BMIF 310, Fall 2010

€

p ≤ imq

Gene list interpretation

BMIF 310, Fall 2010

92546_r_at

92545_f_at

96055_at

102105_f_at

102700_at

161361_s_at

92202_g_at

103548_at

100947_at

101869_s_at

102727_at

160708_at

…...

Microarray data

Normalization


Clustering Lists of genes with potential biological interest




Data visualization

Biological question

Experiment design

Image analysis

Pre-processing

Data Mining


Clustering

Classification



BMIF 310, Fall 2010

HSPA1A HSPA1B HSPA1L HSPA8 HSPB1 HSPB2 HSPB8 HSPC138 HSPD1 HSPE1 HSPH1 HYPB HYPK IBRDC2 ID4 IGFBP5 IL1F5 IL6ST ……

PNRC1 GADD45B RRAGC DDIT3 ASNS FOSB UBE2H EPC1 HDAC9 JMJD1C RRAGC RIT1 PURA …...

Predefined functional category (339 genes)

  Enrichment ratio: 6.08   p value: 9.34E-9

152

152

compare Observed

Expected

16

2.6

339

339 Input gene list (152 genes)

total

total

Over-representation analysis

BMIF 310, Fall 2010

Over-representation analysis

BMIF 310, Fall 2010

Significant genes Non-significant genes Total

Genes in the category k j-k j

Other genes n-k m-n-j+k m-j Total n m-n m

Hypergeometric distribution: given a total of m genes where j genes are in the functional category, if we pick n genes randomly, what is the probability of having k or more genes from the category?

Zhang et.al. Nucleic Acids Res. 33:W741, 2005

€

p =

m − jn − i

⎛

⎝ ⎜

⎞

⎠ ⎟ ji⎛

⎝ ⎜ ⎞

⎠ ⎟

mn⎛

⎝ ⎜ ⎞

⎠ ⎟ i= k

min(n, j )

∑

Commonly used functional categories

  Gene Ontology (http://www.geneontology.org )   Structured, precisely defined, controlled vocabulary for describing the

roles of genes and gene products

  Three organizing principles: molecular function, biological process, and cellular component

  Pathways   KEGG (http://www.genome.jp/kegg/pathway.html )

  Pathway Commons (http://www.pathwaycommons.org/pc/ )

  WikiPathways (http://www.wikipathways.org )

  Common targets of transcription factors   TRANSFAC (http://www.gene-regulation.com)

  Cytogenetic bands

BMIF 310, Fall 2010

8 organisms

132 ID types

73,986 functional categories

WebGestalt WebGestalt: Web-based Gene Set Analysis Toolkit

http://bioinfo.vanderbilt.edu/webgestalt

Zhang et.al. Nucleic Acids Res. 33:W741, 2005 Duncan et al. BMC Bioinformatics. 11:P10, 2010

BMIF 310, Fall 2010

WebGestalt: over-represented GO biological processes

BMIF 310, Fall 2010

WebGestalt: over-represented pathway

BMIF 310, Fall 2010

  Does not account for the order of genes in the significant gene list

  Arbitrary thresholding leads to the lose of information

  Assumes genes are independent

Limitation of the over-representation analysis

BMIF 310, Fall 2010

  Test whether the members of a predefined functional category are randomly distributed throughout the ranked gene list   Calculation of an Enrichment Score, modified Kolmogorov Smirnov test

  Estimation of Significance Level of ES, permutation test

  Adjustment for Multiple Hypothesis Testing, control False Discovery Rate

  Leading edge subset: genes contribute to the significance

Gene Set Enrichment Analysis (GSEA)

BMIF 310, Fall 2010

Subramanian et.al. PNAS 102:15545, 2005

http://www.broad.mit.edu/gsea/

Summary



Data visualization

Biological question

Experiment design

Image analysis

Pre-processing

Data Mining


Clustering

Classification



BMIF 310, Fall 2010

gene expression data analysis (i ... - vanderbilt...

Documents