22/01/2013
1
Research Advanced Course. Liverpool, January 2013
Principles of Sta5s5cal Data Processing
Marta Milo University of Sheffield
Department of Biomedical Science
Research Advanced Course. Liverpool, January 2013
Outline 09:00-‐10:30 ( Marta Milo)
Part I – Introduc-on to data analysis: • General principles • Defining pipelines • Selec5on of an appropriate sta5s5cal model
Part II -‐ Low-‐level analysis of the data • Data normaliza5on and removal of ar5facts • Diagnos5cs and ini5al visualiza5on • Gene expression es5ma5on and data structure • Differen5al Expression analysis and Clustering methods
10:30-‐11:00 Coffee break 11:00-‐11:45
Part III – High-‐level analysis of microarray data: • Mul5ple sampling and determina5on of significance • FDR rather than p value as an approach to significance • Confounding factors and uncertainty on gene expression
Morning – Marta Milo / Nicolò Fusi
Research Advanced Course. Liverpool, January 2013
Outline (cont.)
11:45-‐12:30 (Steve Paterson) Introduc5on to high-‐throughput sequencing data analysis 12:30-‐13:30 Lunch -‐ Madisons, Sherrington Building 13:30-‐14:30 Hands on session: Analysis of either microarray data and high-‐throughput sequencing data 14:30-‐15:00 Coffee break 15:00-‐16:30 Hands on session cont.
Prac-cals (Steve Paterson)
22/01/2013
2
Research Advanced Course. Liverpool, January 2013
1. Importance of defining your research ques5ons, keeping in mind limita5ons and effec5ve use of the data
2. Consistency in sample prepara5on, op5misa5on of the samples, extensive QC of the data. LOOK at the data generated and QC before processing
3. Choose the correct model to analyse your data, define appropriate parameters (RNA-‐Seq analysis) to get the maximum informa5on out of your data
4. Use the best tool to visualise your data, to discriminate, cluster and rank your significant targets
5. Using of pathway analysis for defining novel hypothesis that can be inves5gated with “specific tools” , mathema5cal and experimental
Research Advanced Course. Liverpool, January 2013
Importance of Sample prepara5on
It is certainly important to pause here and THINK about the nature of our
samples
Why is it so important for the data analysis?
Research Advanced Course. Liverpool, January 2013
RNA sample prepara5on for high-‐throughput assays
Experimental design: your RNA collec5on needs to reflect the biological ques5ons you are asking
Op-mise protocols:
minimise technical errors minimise batch effects clean and pure sample to avoid contamina5on
Es-mate the correct quan--es:
samples for valida5on op5misa5on of the protocols – avoid satura5on or low quan5fica5on
Technical and biological replicates:
ensure you have SOP in place
22/01/2013
3
Research Advanced Course. Liverpool, January 2013
RNA sample prepara5on for microarray assays
• RNA need to be pure and intact
• Clean from genomic DNA and solvents
• Use RNA extrac5on protocols that are suitable for microarray assays
• Extensive QC at each step
• Consistent technical execu5on of the protocols
• If samples extracted from cells/ 5ssue that have a large component of “redundant RNA” make sure you deplete the sample from it.
whole blood-‐ hemoglobin
• Make sure that the process of cleaning DO NOT compromise the quality of RNA
Research Advanced Course. Liverpool, January 2013
RNA sample prepara5on for microarray assays (cont.)
Make sure that the sample you can collect is sufficient for covering your experimental design
technical replica5on : ensure you have enough RNA
biological replica5on : ensure random extrac5on to minimise batch effects Complex protocol, make sure you have SOPs in place.
Consistency and minimisa5on of technical varia5on
Research Advanced Course. Liverpool, January 2013
RNA sample prepara5on for RNA-‐Seq
Be very specific on the condi5ons and on the samples you are collec5ng
Iden5fy the two main technical prepara5on:
poly(A) enrichment
ribosomal deple5on Extensive QC – Prepara5on of the library -‐-‐-‐ see John Kenny’s presenta5on
22/01/2013
4
Research Advanced Course. Liverpool, January 2013
Key ques5ons we need to consider
How much do I know about my system and the species I am studying?
How much the technologies that are available know about my system?
What is my reference genome? How do make it “sensi5ve” to my system?
Are the signals (reads) specific to my ques5ons? If not, how do I adjust my experimental design so to increase sensi5vity in my predic5ons?
Is high-‐throughput approach the correct approach for my research ques5on?
Ques-ons that we can leave open to DISCUSSION
Research Advanced Course. Liverpool, January 2013
It is worth spending some extra 5me and few more control experiments before embargoing into gene expression studies with microarray and NGS studies The data is more interpretable, the predic5ons are possibly more robust, even if the number of significant targets appear to be small Predic5ons are olen validated – when robust.
A pearl of wisdom
Research Advanced Course. Liverpool, January 2013
All the varia5on and to ensure that the analysis of the data is as reproducible as the experimental collec5on of samples generate the need to define pipelines for the analysis of the data
PIPELINES: Reproducible and robust protocols for numerical experimenta;ons. In case of biological data they are tailored to the
system/organism under study.
Pipelines
HOW DO WE GET THEM?
22/01/2013
5
Research Advanced Course. Liverpool, January 2013
Example of Pipelines
hmp://www.liv.ac.uk/genomic-‐research/bioinforma5cs/
Microarray pipeline
Research Advanced Course. Liverpool, January 2013
PART II
Research Advanced Course. Liverpool, January 2013
Choose the appropriate Sta5s5cal Model
Different plaQorms that generate gene expression: • Two or One color spomed cDNA arrays
• Affymetrix -‐ new HJAY • Illumina Arrays
• Reads – RNA-‐Seq and other NGS assays
Interpret and analyse the data by first understanding where the data is coming from
22/01/2013
6
Research Advanced Course. Liverpool, January 2013 Wikipedia – DNA microarrays
How do I quan-fy the gene expression?
Image processing noise Sensi5vity
Research Advanced Course. Liverpool, January 2013
20µm"
Millions of copies of a specific"oligonucleotide sequence element"
Image of Hybridised Array"
~ 1,000,000 different"complementary oligonucleotides ""
Single stranded, "labeled RNA sample"Oligonucleotide element"
1.28cm"
GeneChip® Array"
Slide courtesy of Affymetrix"
* **
**
Research Advanced Course. Liverpool, January 2013 J Biomol Tech. 2004 December; 15(4): 276–284.
• Rela5ve expression • Important to choose the reference • Important to choose the experimental
design
• Es5ma5on of Absolute expression • No reference • Not specific to the ques5on you are asking • Important the design – data analysis
22/01/2013
7
Research Advanced Course. Liverpool, January 2013
_at : probe sets are predicted to perfectly match only a single transcript _s_at : are predicted to perfectly match mul5ple transcripts, which may be from different genes _x_at :probe sets will contain some probes that are iden5cal or highly similar to other
sequences. Hybridize uniformly across probe pairs to the intended target
Probe Set Nota5on
Research Advanced Course. Liverpool, January 2013
HG_U133 Plus v2 Affymetrix genChip The sequences from which these probe sets were derived were selected from GenBank®, dbEST, and RefSeq. A single array contains with more than 54,000 probe sets represen5ng approximately 38,500 genes (es5mated by UniGene coverage). 70 percent of the probe sets represent subcluster assemblies containing one or more non-‐ EST sequences. Of the 16,737 EST-‐based probe sets, approximately 9,000 probe sets can now be associated with an mRNA or other non-‐EST sequence.
MM specific binding affinity Now with new arrays HJAY…..
Research Advanced Course. Liverpool, January 2013
The HJAY array plauorm was designed using content from ExonWalk (C. Sugnet), Ensembl, and RefSeq databases (Na5onal Center for Biotechnology Informa5on build 36). It interrogates ∼ 315,000 human transcripts from ∼ 35,000 genes and contains ∼ 260,000 junc5on (JUC) and ∼ 315,000 exonic (PSR) probe sets. A frac5on of probe sets had no-‐unique loca5ons in the human genome and were likely to give cross-‐hybridiza5on signal. Lapuk et al. used only 501,557 of probe sets from 23,546 transcript clusters were retained.
STILL LOADS TO DO….
Affymetrix GeneChip HJAY
22/01/2013
8
Research Advanced Course. Liverpool, January 2013
What’s in the data?
How we define a measure that best represent the absolute expression level of each gene on the chip?
0 500
1000 1500 2000 2500
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 0 200 400 600 800
1000 1200 1400 1600
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
PM MM
1. Summarise to a single expression level the probe intensi5es for each probe set
2. Es5mate the varia5ons introduced by background effect probe affinity effect
3. Some PM/MM pairs are more reliable than others 4. The signal needs to be scaled before comparing data from different arrays
Research Advanced Course. Liverpool, January 2013
Use single point sta-s-cs make use of the informa5on we have to define values that es5mate gene expression
MAS 5. RMA – GCRMA PLIER
Use a probabilis-c approach
make use of the observed data to es5mate func5on that have generated that data Es5mates of gene expression will be the most probable value that summarises the probe set PUMA
The approaches
Research Advanced Course. Liverpool, January 2013
Microarray Suite (MAS5.0)
Signal ~ TukeyBiweight(log2(PMj – IMj))
• Signal = Smoothed average over PM/MM pairs representing a gene
• Signal is always positive: Absent - Present Call
Correction for global background.- based on 16 sectors on each array
Ideal mismatch (IM) intensity calculated from MM value and subtracted from PM.
- if MM < PM then IM = MM - if MM > PM then IM = PM – correction value
22/01/2013
9
Research Advanced Course. Liverpool, January 2013
MAS5: p-‐value and calls
• First calculate discriminant for each probe pair: R=(PM-‐MM)/(PM+MM)
• Wilcoxon one sided ranked test used to compare R vs tau value and determine p-‐value
• Present/Marginal/Absent calls are thresholded from p=value above and – Present =< alpha1 – alpha1 < Marginal < alpha2 – Alpha2 <= Absent
• Default: alpha1=0.04, alpha2=0.06, tau=0.015
Not very precise, accurate only when many replicates are available. Dependent strongly on MM, Uses linear scaling normalisa5on
Research Advanced Course. Liverpool, January 2013
• Subtract background for each array from PM
• Intensity- dependent normalisation of PM-Bkgd
• Log transform
• Robust multichip analysis of all PM reporters in the set using Tukey median polishing procedure
• Quantile normalisation :Fit all the chips to the same distribu5on. Scale the chips so that they have the same mean.
Robust Multi-array Average (RMA)
Signal ~ Tukey (log2(PMj – bkgdj))
Research Advanced Course. Liverpool, January 2013
RMA Assump-ons: 1. log transformed, background corrected expression values follow a linear
model,
2. linear Model is es5mated by using a “median polish” algorithm
3. needs replicates Used with groups of chips (>3), more chips are bemer
4. assumes all chips have same background, distribu5on of values.
5. does not use the MM probes as (PM-‐MM*) leads to high variance
6. ignoring MM decreases accuracy, increases precision
22/01/2013
10
Research Advanced Course. Liverpool, January 2013
Robust Multi-array Average (GCRMA)
MM specific binding affinity
Need to model that and include it in the es5ma5on of the signal -‐-‐-‐ GC content of the probes
Background adjustment: based on sequence specificity brightness in the PM probes.
Although it is model based approach: defines model then tries to fit experimental data to the model. DOES need mul5ple samples! Assump5on: The input is a group of samples that have same distribu5on of intensi5es. Nature Biotechnology 22, 656 -‐ 658 (2004)
doi:10.1038/nbt0604-‐656b
Research Advanced Course. Liverpool, January 2013
The PLIER algorithm was developed by Affymetrix and released in 2004. It is part of several commercially available Incorporates experimental observa5ons of feature behavior. It uses a probe affinity parameter, which represents the strength of a signal produced at a specific concentra5on for a given probe. The probe affini5es are calculated using data across arrays. The error model employed by PLIER assumes error is propor5onal to observed intensity, rather than to background-‐subtracted intensity. It assumes that the error of the mismatch probe is the reciprocal of the error of the perfect match probe.
Probe Logarithmic Intensity Error (PLIER)
Improved precision over MAS 5
Research Advanced Course. Liverpool, January 2013
Models {PM,MM} distribu5ons for each probe-‐set
signal
0
PM
prob
abili
ty
values
MM
puma: gMOS and mulN-‐mgMOS
Gamma func5ons are used to model the posi5ve probe intensi5es. mgMOS: MM and PM are drown from a joint probability distribu5on:
The bgj are latent variables reflecNng the different binding affinity of probes within the probe-‐set
where ygjc and mgjc represent, respec5vely, PM and MM intensi5es of the j -‐th probe-‐pair in the g-‐th probe set on the c-‐th chip (a gamma distribu5on with the same inverse scale parameter bgj which is probe-‐pair specific) The shape parameters of the two gamma distribu5ons are the sum of the background term agc and the true specific hybridiza5on signal term αgc which are probe-‐set and chip specific.
Computa5onally efficient method is used for es5ma5ng the posterior distribu5on of the signal: Posterior is unimodal and approximated with a truncated Gaussian.
€
p(ygj ,mgj ) = dbgj p(bgj )p(ygj ,mgj |∫ ag,αg ,bgj )
€
ygjc ~ Ga(agc +αgc,bgj ),mgjc ~ Ga(agc + φαgc,bgj )mmgMOS: specific MM binding and mul5ple informa5on across chips
22/01/2013
11
Research Advanced Course. Liverpool, January 2013
Normalisa5on
Why do we need to normalise the data?
1. we want to compare across chips 2. we need to ensure that all the data is equally compared across baseline within the chip
Most methods will have normalisa5on step incorporated, some other will need to perform it aler gene expression es5ma5on Scaling – Mean and Median Quan5le Loess
Research Advanced Course. Liverpool, January 2013
Normalisa5on: scaling
The assump5on that mapping using quan5les or scaling is reasonable is based on the assump5on that “most genes don’t change”, and quan5les use this more extensively than scaling. If this underlying assump5on is doubuul, then using the above methods is not advisable.
Simply linearly scale the gene expression so that the overall mean / median are the same. The median is more scale-‐invariant, but for the most part there is limle prac5cal difference.
Research Advanced Course. Liverpool, January 2013
Normalisa5on: quan5le
Assume that the distribu5ons of probe intensi5es should be completely the same across chips. Start with n arrays, and p probes, and form a [p,n] matrix X. Sort the columns of X, so that the entries in a given row correspond to a fixed quan5le. Replace all entries in that row with their mean value.
22/01/2013
12
Research Advanced Course. Liverpool, January 2013
Two-‐ color array normalisa5on
Global Normalisa-on methods assume the two dyes are related by a constant factor
Local normalisa-on methods assume that the dye factor is dependent on:
― Spot intensity (defined as A=RG). ― Loca5on on the array.
Most common methods are: print-‐5p effect correc5on intensity dependent Loess
Research Advanced Course. Liverpool, January 2013
• Visualise the effect: M-‐A plot
• Correc5on of the intensity dependant varia5ons:
Research Advanced Course. Liverpool, January 2013
LOESS normalisa5on LOcally (W)Eighted polynomial regreSSion.
Cleveland, W. S. (1979). Robust locally weighted regression and smoothing scatterplots.
JASA 74 829-836.
M = Adjusted Log Re
d – Ad
justed
Log Green
A = (Adjusted Log Green + Adjusted Log Red) / 2
22/01/2013
13
Research Advanced Course. Liverpool, January 2013
Benchmarks and comparisons
hmp://affycomp.biostat.jhsph.edu/ Affycomp III: A Benchmark for Affymetrix GeneChip Expression Measures
Alterna5ves to microarrays:
NGS gene expression quan5fica5on ( e.g RNA-‐seq) Affymetrix GeneChip Junc5on Arrays
There are not established sites and plauorms for this yet. Benchmark datesets are available SEQanswers -‐ the NGS community hmp://seqanswers.com/forums/showthread.php?t=10797
The MicroArray Quality Control (MAQC)-‐II study Nat. Biotech (2010)
Research Advanced Course. Liverpool, January 2013
For NGS gene expression quan-fica-on….
A lot changes … but
• Es5ma5on of gene expression • Alterna5ve splicing iden5fica5on • Alterna5ve isoform detec5on • Transcripts abundance
• Normalise read counts • Normalise reads between lanes • Normalise reads against transcripts abundance and gene length • Varying sequencing depth • Other technical effects
Visualise and interpret Differen5al expression Clustering
QUANTIFICATION
NORMALISATION
HIGH LEVEL ANALYSIS
Research Advanced Course. Liverpool, January 2013
Data Visualisa-on
Scamer Plot
Box Plot
Slide 2 Cy3 Cy5 Slide 1
Cy3 Cy5
median
Q3=75th percen5le
Q1=25th percen5le
minimum
maximum
MA Plot
Log Abundance
Log Fold Change
22/01/2013
14
Research Advanced Course. Liverpool, January 2013
Principal Component Analysis
It is one of the most commonly used technique to visualise and interpret high dimensional data It iden5fies the maximum spread of the data maximising the variance by rota5ng the space where the data lives. It uses a set of variables that are hidden to the user and are implicitly explained by the data (latent variables) Every direc5on found that extract informa5ve features from the “noisy” cloud of data points is called a principal component Dimensionality reduc5on
Research Advanced Course. Liverpool, January 2013
Principal Component Analysis (cont…)
Y Haile-Selassie et al. Nature 483, 565-569 (2012) doi:10.1038/nature10922
PC1
PC2
usually reasonable, but it assumes that the uncertainty associated to each gene is constant non-‐linear transforma5on of gene expression (Huber et al. 2002),PUMA PCA (Sanguine� et al., 2005)
Research Advanced Course. Liverpool, January 2013
• basic idea: group together genes that have similar pamern of expression across condi5ons or across 5me
• what do we mean by similar?
• different measures of similarity: Euclidean distance, angle • between vectors, correla5on coefficient, . . .
• Shared pamern of expression might be associate to similar func5ons
Clustering
22/01/2013
15
Research Advanced Course. Liverpool, January 2013
• builds a hierarchy of clusters
• bomom up (merging clusters) or top down (spli�ng clusters)
• Eisen et al. (1998). The genes that are most correlated are joined together, the expression value for the resul5ng node is the average expression of the two (or more) genes. The similarity matrix is then updated with the new node.
• Different similarity measures lead to different interpreta5ons
Hierarchical Clustering
Research Advanced Course. Liverpool, January 2013
Given two gene expression values x and y the fold change is defined as FC= x/y
Given two vectors xj and yj of gene expression measurements for controls and cases for GENE j, the fold change is defined as
FCj = μj / μj It can also appear as a difference.
The Fold Change
Research Advanced Course. Liverpool, January 2013
Differen-al expression Analysis
GOAL: Iden5fy the most differen5ally expressed genes across different condi5ons or cases and controls. HOW:
iden5fy a threshold that “define” differen5al expression FC values p-‐values we need to quan5fy False Discovery Rate
What happens if the sample size is small? • The fold-‐change becomes very sensi5ve to outliers • The t-‐test becomes very sensi5ve to small variances
22/01/2013
16
Research Advanced Course. Liverpool, January 2013
The problem of Mul5ple sampling
Part III