gene expression data analysis (i ... - vanderbilt...
TRANSCRIPT
Gene Expression Data Analysis (I)
Bing Zhang Department of Biomedical Informatics
Vanderbilt University
Bioinformatics tasks
Microarray experiment
Data storage Data integration
Data visualization
Biological question
Experiment design
Image analysis
Pre-processing
Data Mining
Differential expression
Clustering
Classification
Hypothesis Experimental verification
Biological interpretation
BMIF 310, Fall 2010
Experiment design: well begun is half done
A clearly defined biological question
Well control of potential sources of variation (biological and technical) Statistically sound microarray experimental arrangement (replicates)
Compliance with the standard of microarray information collection (MIAME)
http://www.mged.org/Workgroups/MIAME/miame.html
BMIF 310, Fall 2010
Image analysis
Analysis of the image of the scanned array in order to extract an intensity for each spot or feature on the array.
Gridding: align a grid to the spots
Segmentation: identify the shape of each spot
Intensity extraction: extract intensity for each spot and potentially for each surrounding background
BMIF 310, Fall 2010
Fixed circle
Adaptive circle
Seeded region
growing
Data preprocessing
Background correction: subtract background signal from the spot intensity to get a more accurate estimate of the biological signal from the spot Local subtraction: MM probe or local background
Model-based correction: signal component + noise component
Usually leads to an increased level of noise for low-expressing probe sets
Normalization: remove systematic variation in a microarray experiment which affects the measured gene expression levels Experimenter bias
Variability in experimental conditions
Sample collection and preparation
Machine parameters
Summarization: combine the multiple probe intensities for each probe set to produce an expression value estimate (for affymetrix arrays) RMA model:
k, j, i refer to probe set, array and probe, respectively
BMIF 310, Fall 2010 €
ykij = βkj +αki +ε kij
Normalization (within array, two-channel arrays)
Remove systematic differences due to intensity and location dependent dye biases
MA-plot M=log2(Cy5/Cy3)
A=(log2(Cy5)+log2(Cy3))/2
Normalization Global lowess (Locally weighted scatter
plot smoothing) normalization: a non linear regression of log2(ratio)s against the average log2(intensity). It computes local linear regressions that are joined together to form a smooth curve. This normalization takes into account intensity artifacts.
Print-tip lowess normalization: local linear regression computation are limited to a single print-tip group
BMIF 310, Fall 2010
Before normalization Global lowess normalization
Before normalization Print-tip lowess normalization
Normalization (between arrays)
Adjust the arrays using some control or housekeeping genes that you would expect to have the same intensity level across all of the samples
Adjust using spike control
Multiply each array by a constant to make the mean (median) intensity the same for each individual array (Global normalization)
Match the percentiles of each array (Quantile normalization)
BMIF 310, Fall 2010
No normalization Global normalization Quantile normalization
Get to know your data matrix
BMIF 310, Fall 2010
Sample_1 Sample_2 Sample_3 Sample_4 Sample_5 Sample_6 TNNC1 14.82 14.46 14.76 11.22 11.55 11.18 DKK4 10.71 10.37 11.23 19.74 19.73 18.78 ZNF185 15.20 14.96 15.07 12.57 12.37 12.10 CHST3 13.40 13.18 13.15 11.18 10.99 11.03 FABP3 15.87 15.80 15.85 13.16 12.99 13.05 MGST1 12.76 12.80 12.67 14.92 15.02 15.32 DEFA5 10.63 10.47 10.54 15.52 15.52 14.37 VIL1 11.47 11.69 11.87 13.94 14.01 13.72 AKAP12 18.26 18.10 18.50 15.60 15.69 15.62 HS3ST1 10.61 10.67 10.50 12.44 12.23 12.61 …… …… …… …… …… …… ……
Gen
es
Samples
Bioinformatics tasks
Microarray experiment
Data storage Data integration
Data visualization
Biological question
Experiment design
Image analysis
Pre-processing
Data Mining
Differential expression
Clustering
Classification
Hypothesis Experimental verification
Biological interpretation
BMIF 310, Fall 2010
Differential expression
BMIF 310, Fall 2010
Sample_1 Sample_2 Sample_3 Sample_4 Sample_5 Sample_6 TNNC1 14.82 14.46 14.76 11.22 11.55 11.18 DKK4 10.71 10.37 11.23 19.74 19.73 18.78 ZNF185 15.20 14.96 15.07 12.57 12.37 12.10 CHST3 13.40 13.18 13.15 11.18 10.99 11.03 FABP3 15.87 15.80 15.85 13.16 12.99 13.05 MGST1 12.76 12.80 12.67 14.92 15.02 15.32 DEFA5 10.63 10.47 10.54 15.52 15.52 14.37 VIL1 11.47 11.69 11.87 13.94 14.01 13.72 AKAP12 18.26 18.10 18.50 15.60 15.69 15.62 HS3ST1 10.61 10.67 10.50 12.44 12.23 12.61 …… …… …… …… …… …… ……
Gen
es
Samples
Case Control
Fold change
n-fold change Arbitrarily selected fold change
cut-offs
Usually ≥ 2 fold
Pros Intuitive and easily visualised
Simple and rapid
Cons Statistically inefficient
Magnitude does not necessarily indicate importance
Often too restrictive
MA-plot
M: log ratio ( log2(A/B) )
A: average log intensity ( log2(A*B)/2 )
BMIF 310, Fall 2010
Statistical analysis
BMIF 310, Fall 2010
Sample_1 Sample_2 Sample_3 Sample_4 Sample_5 Sample_6 TNNC1 14.82 14.46 14.76 11.22 11.55 11.18 DKK4 10.71 10.37 11.23 19.74 19.73 18.78 ZNF185 15.20 14.96 15.07 12.57 12.37 12.10 CHST3 13.40 13.18 13.15 11.18 10.99 11.03 FABP3 15.87 15.80 15.85 13.16 12.99 13.05 MGST1 12.76 12.80 12.67 14.92 15.02 15.32 DEFA5 10.63 10.47 10.54 15.52 15.52 14.37 VIL1 11.47 11.69 11.87 13.94 14.01 13.72 AKAP12 18.26 18.10 18.50 15.60 15.69 15.62 HS3ST1 10.61 10.67 10.50 12.44 12.23 12.61 …… …… …… …… …… …… ……
Gen
es
Samples
Case Control
Null hypothesis
Alternative hypothesis
€
H0 : µ1 = µ2
€
H1 : µ1 ≠ µ2
Planning experiments for case-control studies
Simulation of the dependency of fold change detection on the sample size. Experimental error is assumed to be 20%, i.e., CV of replicated control and treatment series equals 0.2. Samples are drawn from Gaussian distributions with mean equal to 1 for the control series and mean equal to 1.5 (black), 2 (red), 2.5 (green), 3 (blue), 5 (yellow), and 10 (magenta) for the treatment samples, respectively, in order to simulate the fold changes. Sampling is repeated 1000 times and the proportion of true positive test results (P < 0.05) is plotted (Y-axis) over the sample size (X-axis).
BMIF 310, Fall 2010
Differential Gene Expression: DNA arrays (continuous data) Statistical tests
Student’s t-test: a two sample location test of the null hypothesis that the means of two normally distributed populations are equal (equal variance).
Welch’s t-test: unequal variance
Mann–Whitney U test (also called Wilcoxon rank-sum test): nonparametric
t-test vs U-test Robustness: U-test is more robust to outliers
Efficiency: When normality holds, the efficiency of the U-test is about 0.95 when compared to the t-test. For distributions sufficiently far from normal and for sufficiently large sample sizes, the U-test can be considerably more efficient than the t-test.
BMIF 310, Fall 2010
t-test: p=0.06; U test: p=0.1 GeneX 9.61 11.03 10.50 11.44 12.23 13.61 GeneX 9.61 11.03 10.50 11.44 12.23 25.61
t-test: p=0.32; U test: p=0.1
Differential Gene Expression: sequencing-based technologies (count data) 2 x 2 contingency table
Statistical tests Chi-square test
Fisher’s exact test
Poisson regression
BMIF 310, Fall 2010
Counts in case
Counts in control
Total
Counts for gene X a b a+b
Counts for all other genes c d c+d
Total a+c b+d a+b+c+d
Correction for multiple testing
Why? In an experiment with a 10,000-gene array in which the significance level
p is set at 0.05, 10,000 x 0.05 = 500 genes would be inferred as significant even though none is differentially expressed
The probability of drawing the wrong conclusion in at least one of the n different test is
Where is the significance level at single gene level, and is the global significance level.
BMIF 310, Fall 2010
€
P(wrong) =1− (1−α s)n = αg
€
αg
€
α s
Correction for multiple testing
Methods Control the family-wise error rate (FWER), the probability that there is a
single type I error in the entire set (family) of hypotheses tested. e.g. Standard Bonferroni Correction. uncorrected p value x no. of genes tested
Control the false discovery rate (FDR), the expected proportion of false positives among the number of rejected hypotheses. e.g. Benjamini and Hochberg correction. Ranking all genes according to their p value
Picking a desired FDR level, q (e.g. 5%)
Starting from the top of the list, accept all genes with , where i is the number of genes accepted so far, and m is the total number of genes tested.
BMIF 310, Fall 2010
€
p ≤ imq
Gene list interpretation
BMIF 310, Fall 2010
92546_r_at
92545_f_at
96055_at
102105_f_at
102700_at
161361_s_at
92202_g_at
103548_at
100947_at
101869_s_at
102727_at
160708_at
…...
Microarray data
Normalization
Differential expression
Clustering Lists of genes with potential biological interest
Bioinformatics tasks
Microarray experiment
Data storage Data integration
Data visualization
Biological question
Experiment design
Image analysis
Pre-processing
Data Mining
Differential expression
Clustering
Classification
Hypothesis Experimental verification
Biological interpretation
BMIF 310, Fall 2010
HSPA1A HSPA1B HSPA1L HSPA8 HSPB1 HSPB2 HSPB8 HSPC138 HSPD1 HSPE1 HSPH1 HYPB HYPK IBRDC2 ID4 IGFBP5 IL1F5 IL6ST ……
PNRC1 GADD45B RRAGC DDIT3 ASNS FOSB UBE2H EPC1 HDAC9 JMJD1C RRAGC RIT1 PURA …...
Predefined functional category (339 genes)
Enrichment ratio: 6.08 p value: 9.34E-9
152
152
compare Observed
Expected
16
2.6
339
339 Input gene list (152 genes)
total
total
Over-representation analysis
BMIF 310, Fall 2010
Over-representation analysis
BMIF 310, Fall 2010
Significant genes Non-significant genes Total
Genes in the category k j-k j
Other genes n-k m-n-j+k m-j Total n m-n m
Hypergeometric distribution: given a total of m genes where j genes are in the functional category, if we pick n genes randomly, what is the probability of having k or more genes from the category?
Zhang et.al. Nucleic Acids Res. 33:W741, 2005
€
p =
m − jn − i
⎛
⎝ ⎜
⎞
⎠ ⎟ ji⎛
⎝ ⎜ ⎞
⎠ ⎟
mn⎛
⎝ ⎜ ⎞
⎠ ⎟ i= k
min(n, j )
∑
Commonly used functional categories
Gene Ontology (http://www.geneontology.org ) Structured, precisely defined, controlled vocabulary for describing the
roles of genes and gene products
Three organizing principles: molecular function, biological process, and cellular component
Pathways KEGG (http://www.genome.jp/kegg/pathway.html )
Pathway Commons (http://www.pathwaycommons.org/pc/ )
WikiPathways (http://www.wikipathways.org )
Common targets of transcription factors TRANSFAC (http://www.gene-regulation.com)
Cytogenetic bands
BMIF 310, Fall 2010
8 organisms
132 ID types
73,986 functional categories
WebGestalt WebGestalt: Web-based Gene Set Analysis Toolkit
http://bioinfo.vanderbilt.edu/webgestalt
Zhang et.al. Nucleic Acids Res. 33:W741, 2005 Duncan et al. BMC Bioinformatics. 11:P10, 2010
BMIF 310, Fall 2010
WebGestalt: over-represented GO biological processes
BMIF 310, Fall 2010
WebGestalt: over-represented pathway
BMIF 310, Fall 2010
Does not account for the order of genes in the significant gene list
Arbitrary thresholding leads to the lose of information
Assumes genes are independent
Limitation of the over-representation analysis
BMIF 310, Fall 2010
Test whether the members of a predefined functional category are randomly distributed throughout the ranked gene list Calculation of an Enrichment Score, modified Kolmogorov Smirnov test
Estimation of Significance Level of ES, permutation test
Adjustment for Multiple Hypothesis Testing, control False Discovery Rate
Leading edge subset: genes contribute to the significance
Gene Set Enrichment Analysis (GSEA)
BMIF 310, Fall 2010
Subramanian et.al. PNAS 102:15545, 2005
http://www.broad.mit.edu/gsea/
Summary
Microarray experiment
Data storage Data integration
Data visualization
Biological question
Experiment design
Image analysis
Pre-processing
Data Mining
Differential expression
Clustering
Classification
Hypothesis Experimental verification
Biological interpretation
BMIF 310, Fall 2010