limma linear models for microarray data. difficulties with microarray data variability of the...
TRANSCRIPT
LIMMA
Linear Models for Microarray Data
Difficulties with microarray data
• Variability of the expression values differs between genes
• Non-identical and dependent distribution between genes
• Multiple testing of tens of thousands of genes
Correct for multiple comparisons
• Multiple testing - Family-wise error rate - False Discovery Rate etc.
• Parallel nature of the inference allows for compensating possibilities
• Borrowing information from the ensemble of genes to assist in inference from individual genes
Empirical Bayes
• Frequentist methods, a hypothesis is typically rejected or not rejected without directly assigning a probability
• Bayesian methods, specifies some prior probability, which is then updated in the light of new data.
• For Bayesian techniques, the prior distribution is assigned independent of the data and fixed before any data is observed.
Empirical Bayes
• Superficially similar to Bayesian methods in that a prior distribution is assigned.
• However, prior distribution is estimated from the data
• Therefore Empirical Bayes is a frequentist technique
LIMMA
• Empiricial Bayes techniques have previously been applied to microarray data
• Analysis specific to experiment and very difficult to implement
• LIMMA - Simple model with simple expression of posterior odds
• Allows linear modelling to be applied to microarray data
Estrogen Data
• 2x2 factorial experiment on MCF7 breast cancer cells using Affymetrix HGU95av2 arrays
• Factors : Estrogen (Presence/Absence)
Length of exposure (10hr/48hr)
• The idea of the study is to identify genes that respond to estrogen treatment
Read in the Data
• Load in the estrogen data
• Normalise the data
• Define the targets (factors) for the linear model
Design Matrix
• Eight arrays• Four pairs of replicates • Four parameters in the linear model
1 low10-1.cel absent 10
2 low10-2.cel absent 10
3 high10-1.cel present 10
4 high10-2.cel present 10
5 low48-1.cel absent 48
6 low48-2.cel absent 48
7 high48-1.cel present 48
8 high48-2.cel present 48
Contrast Matrix1 low10-1.cel absent 10
2 low10-2.cel absent 10
3 high10-1.cel present 10
4 high10-2.cel present 10
5 low48-1.cel absent 48
6 low48-2.cel absent 48
7 high48-1.cel present 48
8 high48-2.cel present 48
Estrogen effect at 10 hours
Estrogen effect at 48 hoursTime effect without estrogen
Differential Expression
• Extract linear model fit for contrasts
• Obtain list of differentially expressed genes for contrasts
• Look for overlap among differentially expressed genes
Linear Model Fit
• logFC - Estimate of the log2-fold-change corresponding to the effect or contrast
• AveExpr - Average log2-expression for the probe over all arrays/channels
• t - moderated t-statistic• P.Value - Raw p-value• adj.P.Value -Adjusted p-value• B - log odds that the gene is differentially
expressed
Annotating Data
• Probe arrays can be annotated with external data
• Multiple sources of gene annotations
Gene Set Enrichment
• All biochemical pathways are determined by sets of genes
• Gene sets are determined by prior biological knowledge relating to co-expression, function, location or known biochemical pathways.
• If a pathway is in any way related to a biological trait then the co-functioning genes should display a higher degree of enrichment compared to the rest of the transcriptome.
• Gene Set Enrichment (GSE) is a computational technique which determines whether a priori defined set of genes show statistically significant overlap
Estrogen receptor (ER) gene set
• If estrogen is present, ER genes will bind the estrogen and become activated
• Gain ability to regulate gene expression and result in differential expression between the cells with and without estrogen
• Should lead to up regulation of ER genes