Download - Session1 Wed 23 1 13pcstevep11/expression2013/resources/Milo1.pdf22/01/2013 3 Research(Advanced(Course.(Liverpool,(January(2013(RNA&sample&preparaon&forµarray&assays& • RNA&need&to&be&pure&and&intact

22/01/2013

1

Research Advanced Course. Liverpool, January 2013

Principles of Sta5s5cal Data Processing

Marta Milo University of Sheffield

Department of Biomedical Science


Outline 09:00-‐10:30 ( Marta Milo)

Part I – Introduc-on to data analysis: •  General principles •  Defining pipelines •  Selec5on of an appropriate sta5s5cal model

Part II -‐ Low-‐level analysis of the data •  Data normaliza5on and removal of ar5facts •  Diagnos5cs and ini5al visualiza5on •  Gene expression es5ma5on and data structure •  Differen5al Expression analysis and Clustering methods

10:30-‐11:00 Coffee break 11:00-‐11:45

Part III – High-‐level analysis of microarray data: •  Mul5ple sampling and determina5on of significance •  FDR rather than p value as an approach to significance •  Confounding factors and uncertainty on gene expression

Morning – Marta Milo / Nicolò Fusi


Outline (cont.)

11:45-‐12:30 (Steve Paterson) Introduc5on to high-‐throughput sequencing data analysis 12:30-‐13:30 Lunch -‐ Madisons, Sherrington Building 13:30-‐14:30 Hands on session: Analysis of either microarray data and high-‐throughput sequencing data 14:30-‐15:00 Coffee break 15:00-‐16:30 Hands on session cont.

Prac-cals (Steve Paterson)

22/01/2013

2


1. Importance of defining your research ques5ons, keeping in mind limita5ons and effec5ve use of the data

2. Consistency in sample prepara5on, op5misa5on of the samples, extensive QC of the data. LOOK at the data generated and QC before processing

3. Choose the correct model to analyse your data, define appropriate parameters (RNA-‐Seq analysis) to get the maximum informa5on out of your data

4. Use the best tool to visualise your data, to discriminate, cluster and rank your significant targets

5. Using of pathway analysis for defining novel hypothesis that can be inves5gated with “specific tools” , mathema5cal and experimental


Importance of Sample prepara5on

It is certainly important to pause here and THINK about the nature of our

samples

Why is it so important for the data analysis?


RNA sample prepara5on for high-‐throughput assays

Experimental design: your RNA collec5on needs to reflect the biological ques5ons you are asking

Op-mise protocols:

minimise technical errors minimise batch effects clean and pure sample to avoid contamina5on

Es-mate the correct quan--es:

samples for valida5on op5misa5on of the protocols – avoid satura5on or low quan5fica5on

Technical and biological replicates:

ensure you have SOP in place

22/01/2013

3


RNA sample prepara5on for microarray assays

•  RNA need to be pure and intact

•  Clean from genomic DNA and solvents

•  Use RNA extrac5on protocols that are suitable for microarray assays

•  Extensive QC at each step

•  Consistent technical execu5on of the protocols

•  If samples extracted from cells/ 5ssue that have a large component of “redundant RNA” make sure you deplete the sample from it.

whole blood-‐ hemoglobin

•  Make sure that the process of cleaning DO NOT compromise the quality of RNA


RNA sample prepara5on for microarray assays (cont.)

Make sure that the sample you can collect is sufficient for covering your experimental design

technical replica5on : ensure you have enough RNA

biological replica5on : ensure random extrac5on to minimise batch effects Complex protocol, make sure you have SOPs in place.

Consistency and minimisa5on of technical varia5on


RNA sample prepara5on for RNA-‐Seq

Be very specific on the condi5ons and on the samples you are collec5ng

Iden5fy the two main technical prepara5on:

poly(A) enrichment

ribosomal deple5on Extensive QC – Prepara5on of the library -‐-‐-‐ see John Kenny’s presenta5on

22/01/2013

4


Key ques5ons we need to consider

How much do I know about my system and the species I am studying?

How much the technologies that are available know about my system?

What is my reference genome? How do make it “sensi5ve” to my system?

Are the signals (reads) specific to my ques5ons? If not, how do I adjust my experimental design so to increase sensi5vity in my predic5ons?

Is high-‐throughput approach the correct approach for my research ques5on?

Ques-ons that we can leave open to DISCUSSION


It is worth spending some extra 5me and few more control experiments before embargoing into gene expression studies with microarray and NGS studies The data is more interpretable, the predic5ons are possibly more robust, even if the number of significant targets appear to be small Predic5ons are olen validated – when robust.

A pearl of wisdom


All the varia5on and to ensure that the analysis of the data is as reproducible as the experimental collec5on of samples generate the need to define pipelines for the analysis of the data

PIPELINES: Reproducible and robust protocols for numerical experimenta;ons. In case of biological data they are tailored to the

system/organism under study.

Pipelines

HOW DO WE GET THEM?

22/01/2013

5


Example of Pipelines

hmp://www.liv.ac.uk/genomic-‐research/bioinforma5cs/

Microarray pipeline


PART II


Choose the appropriate Sta5s5cal Model

Different plaQorms that generate gene expression: •  Two or One color spomed cDNA arrays

•  Affymetrix -‐ new HJAY •  Illumina Arrays

•  Reads – RNA-‐Seq and other NGS assays

Interpret and analyse the data by first understanding where the data is coming from

22/01/2013

6

Research Advanced Course. Liverpool, January 2013 Wikipedia – DNA microarrays

How do I quan-fy the gene expression?

Image processing noise Sensi5vity


20µm"

Millions of copies of a specific"oligonucleotide sequence element"

Image of Hybridised Array"

~ 1,000,000 different"complementary oligonucleotides ""

Single stranded, "labeled RNA sample"Oligonucleotide element"

1.28cm"

GeneChip® Array"

Slide courtesy of Affymetrix"

* **

**

Research Advanced Course. Liverpool, January 2013 J Biomol Tech. 2004 December; 15(4): 276–284.

•  Rela5ve expression •  Important to choose the reference •  Important to choose the experimental

design

•  Es5ma5on of Absolute expression •  No reference •  Not specific to the ques5on you are asking •  Important the design – data analysis

22/01/2013

7


_at : probe sets are predicted to perfectly match only a single transcript _s_at : are predicted to perfectly match mul5ple transcripts, which may be from different genes _x_at :probe sets will contain some probes that are iden5cal or highly similar to other

sequences. Hybridize uniformly across probe pairs to the intended target

Probe Set Nota5on


HG_U133 Plus v2 Affymetrix genChip The sequences from which these probe sets were derived were selected from GenBank®, dbEST, and RefSeq. A single array contains with more than 54,000 probe sets represen5ng approximately 38,500 genes (es5mated by UniGene coverage). 70 percent of the probe sets represent subcluster assemblies containing one or more non-‐ EST sequences. Of the 16,737 EST-‐based probe sets, approximately 9,000 probe sets can now be associated with an mRNA or other non-‐EST sequence.

MM specific binding affinity Now with new arrays HJAY…..


The HJAY array plauorm was designed using content from ExonWalk (C. Sugnet), Ensembl, and RefSeq databases (Na5onal Center for Biotechnology Informa5on build 36). It interrogates ∼ 315,000 human transcripts from ∼ 35,000 genes and contains ∼ 260,000 junc5on (JUC) and ∼ 315,000 exonic (PSR) probe sets. A frac5on of probe sets had no-‐unique loca5ons in the human genome and were likely to give cross-‐hybridiza5on signal. Lapuk et al. used only 501,557 of probe sets from 23,546 transcript clusters were retained.

STILL LOADS TO DO….

Affymetrix GeneChip HJAY

22/01/2013

8


What’s in the data?

How we define a measure that best represent the absolute expression level of each gene on the chip?

0 500

1000 1500 2000 2500

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 0 200 400 600 800

1000 1200 1400 1600

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

PM MM

1. Summarise to a single expression level the probe intensi5es for each probe set

2. Es5mate the varia5ons introduced by background effect probe affinity effect

3. Some PM/MM pairs are more reliable than others 4. The signal needs to be scaled before comparing data from different arrays


Use single point sta-s-cs make use of the informa5on we have to define values that es5mate gene expression

MAS 5. RMA – GCRMA PLIER

Use a probabilis-c approach

make use of the observed data to es5mate func5on that have generated that data Es5mates of gene expression will be the most probable value that summarises the probe set PUMA

The approaches


Microarray Suite (MAS5.0)

Signal ~ TukeyBiweight(log2(PMj – IMj))

•  Signal = Smoothed average over PM/MM pairs representing a gene

•  Signal is always positive: Absent - Present Call

Correction for global background.- based on 16 sectors on each array

Ideal mismatch (IM) intensity calculated from MM value and subtracted from PM.

- if MM < PM then IM = MM - if MM > PM then IM = PM – correction value

22/01/2013

9


MAS5: p-‐value and calls

•  First calculate discriminant for each probe pair: R=(PM-‐MM)/(PM+MM)

•  Wilcoxon one sided ranked test used to compare R vs tau value and determine p-‐value

•  Present/Marginal/Absent calls are thresholded from p=value above and –  Present =< alpha1 –  alpha1 < Marginal < alpha2 –  Alpha2 <= Absent

•  Default: alpha1=0.04, alpha2=0.06, tau=0.015

Not very precise, accurate only when many replicates are available. Dependent strongly on MM, Uses linear scaling normalisa5on


•  Subtract background for each array from PM

•  Intensity- dependent normalisation of PM-Bkgd

•  Log transform

•  Robust multichip analysis of all PM reporters in the set using Tukey median polishing procedure

•  Quantile normalisation :Fit all the chips to the same distribu5on. Scale the chips so that they have the same mean.

Robust Multi-array Average (RMA)

Signal ~ Tukey (log2(PMj – bkgdj))


RMA Assump-ons: 1.  log transformed, background corrected expression values follow a linear

model,

2.  linear Model is es5mated by using a “median polish” algorithm

3.  needs replicates Used with groups of chips (>3), more chips are bemer

4.  assumes all chips have same background, distribu5on of values.

5.  does not use the MM probes as (PM-‐MM*) leads to high variance

6.  ignoring MM decreases accuracy, increases precision

22/01/2013

10


Robust Multi-array Average (GCRMA)

MM specific binding affinity

Need to model that and include it in the es5ma5on of the signal -‐-‐-‐ GC content of the probes

Background adjustment: based on sequence specificity brightness in the PM probes.

Although it is model based approach: defines model then tries to fit experimental data to the model. DOES need mul5ple samples! Assump5on: The input is a group of samples that have same distribu5on of intensi5es. Nature Biotechnology 22, 656 -‐ 658 (2004)

doi:10.1038/nbt0604-‐656b


The PLIER algorithm was developed by Affymetrix and released in 2004. It is part of several commercially available Incorporates experimental observa5ons of feature behavior. It uses a probe affinity parameter, which represents the strength of a signal produced at a specific concentra5on for a given probe. The probe affini5es are calculated using data across arrays. The error model employed by PLIER assumes error is propor5onal to observed intensity, rather than to background-‐subtracted intensity. It assumes that the error of the mismatch probe is the reciprocal of the error of the perfect match probe.

Probe Logarithmic Intensity Error (PLIER)

Improved precision over MAS 5


Models {PM,MM} distribu5ons for each probe-‐set

signal

0

PM

prob

abili

ty

values

MM

puma: gMOS and mulN-‐mgMOS

Gamma func5ons are used to model the posi5ve probe intensi5es. mgMOS: MM and PM are drown from a joint probability distribu5on:

The bgj are latent variables reflecNng the different binding affinity of probes within the probe-‐set

where ygjc and mgjc represent, respec5vely, PM and MM intensi5es of the j -‐th probe-‐pair in the g-‐th probe set on the c-‐th chip (a gamma distribu5on with the same inverse scale parameter bgj which is probe-‐pair specific) The shape parameters of the two gamma distribu5ons are the sum of the background term agc and the true specific hybridiza5on signal term αgc which are probe-‐set and chip specific.

Computa5onally efficient method is used for es5ma5ng the posterior distribu5on of the signal: Posterior is unimodal and approximated with a truncated Gaussian.

€

p(ygj ,mgj ) = dbgj p(bgj )p(ygj ,mgj |∫ ag,αg ,bgj )

€

ygjc ~ Ga(agc +αgc,bgj ),mgjc ~ Ga(agc + φαgc,bgj )mmgMOS: specific MM binding and mul5ple informa5on across chips

22/01/2013

11


Normalisa5on

Why do we need to normalise the data?

1. we want to compare across chips 2. we need to ensure that all the data is equally compared across baseline within the chip

Most methods will have normalisa5on step incorporated, some other will need to perform it aler gene expression es5ma5on Scaling – Mean and Median Quan5le Loess


Normalisa5on: scaling

The assump5on that mapping using quan5les or scaling is reasonable is based on the assump5on that “most genes don’t change”, and quan5les use this more extensively than scaling. If this underlying assump5on is doubuul, then using the above methods is not advisable.

Simply linearly scale the gene expression so that the overall mean / median are the same. The median is more scale-‐invariant, but for the most part there is limle prac5cal difference.


Normalisa5on: quan5le

Assume that the distribu5ons of probe intensi5es should be completely the same across chips. Start with n arrays, and p probes, and form a [p,n] matrix X. Sort the columns of X, so that the entries in a given row correspond to a fixed quan5le. Replace all entries in that row with their mean value.

22/01/2013

12


Two-‐ color array normalisa5on

Global Normalisa-on methods assume the two dyes are related by a constant factor

Local normalisa-on methods assume that the dye factor is dependent on:

― Spot intensity (defined as A=RG). ― Loca5on on the array.

Most common methods are: print-‐5p effect correc5on intensity dependent Loess


•  Visualise the effect: M-‐A plot

•  Correc5on of the intensity dependant varia5ons:


LOESS normalisa5on LOcally (W)Eighted polynomial regreSSion.

Cleveland, W. S. (1979). Robust locally weighted regression and smoothing scatterplots.

JASA 74 829-836.

M = Adjusted Log Re

d – Ad

justed

Log Green

A = (Adjusted Log Green + Adjusted Log Red) / 2

22/01/2013

13


Benchmarks and comparisons

hmp://affycomp.biostat.jhsph.edu/ Affycomp III: A Benchmark for Affymetrix GeneChip Expression Measures

Alterna5ves to microarrays:

NGS gene expression quan5fica5on ( e.g RNA-‐seq) Affymetrix GeneChip Junc5on Arrays

There are not established sites and plauorms for this yet. Benchmark datesets are available SEQanswers -‐ the NGS community hmp://seqanswers.com/forums/showthread.php?t=10797

The MicroArray Quality Control (MAQC)-‐II study Nat. Biotech (2010)


For NGS gene expression quan-fica-on….

A lot changes … but

•  Es5ma5on of gene expression •  Alterna5ve splicing iden5fica5on •  Alterna5ve isoform detec5on •  Transcripts abundance

•  Normalise read counts •  Normalise reads between lanes •  Normalise reads against transcripts abundance and gene length •  Varying sequencing depth •  Other technical effects

Visualise and interpret Differen5al expression Clustering

QUANTIFICATION

NORMALISATION

HIGH LEVEL ANALYSIS


Data Visualisa-on

Scamer Plot

Box Plot

Slide 2 Cy3 Cy5 Slide 1

Cy3 Cy5

median

Q3=75th percen5le

Q1=25th percen5le

minimum

maximum

MA Plot

Log Abundance

Log Fold Change

22/01/2013

14


Principal Component Analysis

It is one of the most commonly used technique to visualise and interpret high dimensional data It iden5fies the maximum spread of the data maximising the variance by rota5ng the space where the data lives. It uses a set of variables that are hidden to the user and are implicitly explained by the data (latent variables) Every direc5on found that extract informa5ve features from the “noisy” cloud of data points is called a principal component Dimensionality reduc5on


Principal Component Analysis (cont…)

Y Haile-Selassie et al. Nature 483, 565-569 (2012) doi:10.1038/nature10922

PC1

PC2

usually reasonable, but it assumes that the uncertainty associated to each gene is constant non-‐linear transforma5on of gene expression (Huber et al. 2002),PUMA PCA (Sanguine� et al., 2005)


•  basic idea: group together genes that have similar pamern of expression across condi5ons or across 5me

•  what do we mean by similar?

•  different measures of similarity: Euclidean distance, angle •  between vectors, correla5on coefficient, . . .

•  Shared pamern of expression might be associate to similar func5ons

Clustering

22/01/2013

15


•  builds a hierarchy of clusters

•  bomom up (merging clusters) or top down (spli�ng clusters)

•  Eisen et al. (1998). The genes that are most correlated are joined together, the expression value for the resul5ng node is the average expression of the two (or more) genes. The similarity matrix is then updated with the new node.

•  Different similarity measures lead to different interpreta5ons

Hierarchical Clustering


Given two gene expression values x and y the fold change is defined as FC= x/y

Given two vectors xj and yj of gene expression measurements for controls and cases for GENE j, the fold change is defined as

FCj = μj / μj It can also appear as a difference.

The Fold Change


Differen-al expression Analysis

GOAL: Iden5fy the most differen5ally expressed genes across different condi5ons or cases and controls. HOW:

iden5fy a threshold that “define” differen5al expression FC values p-‐values we need to quan5fy False Discovery Rate

What happens if the sample size is small? •  The fold-‐change becomes very sensi5ve to outliers •  The t-‐test becomes very sensi5ve to small variances

22/01/2013

16


The problem of Mul5ple sampling

Part III

Download - Session1 Wed 23 1 13pcstevep11/expression2013/resources/Milo1.pdf22/01/2013 3 Research(Advanced(Course.(Liverpool,(January(2013(RNA&sample&preparaon&forµarray&assays& • RNA&need&to&be&pure&and&intact

Top Related