normalization of large-scale metabolomic studies 2014

23
Data Normalization Approaches for Large-Scale Metabolomic Studies Dmitry Grapov, PhD

Upload: dmitry-grapov

Post on 10-May-2015

416 views

Category:

Science


2 download

DESCRIPTION

Overview of common data normalization approaches with applications to 2 large scale metabolomic studies.

TRANSCRIPT

Page 1: Normalization of Large-Scale Metabolomic Studies 2014

Data Normalization Approaches for Large-Scale Metabolomic Studies

Dmitry Grapov, PhD

Page 2: Normalization of Large-Scale Metabolomic Studies 2014

Analytical VarianceVariation in sample measurements due to sample handling, data acquisition, processing, etc:• Masks true biological variability• Calculated based on replicated measurements• Can be accounted for using data normalization approaches

Common approaches to minimize analytical variance• Analytical Standards• Quality Control Based Normalizations• Scaling or variance stabilizing normalizations

Case Study: Environmental Determinants of Diabetes in the Young (TEDDY, https://teddy.epi.usf.edu/)

• >1,000 analytes (GC-TOF and LC-Q-TOF)• ~12,000 samples collected over 3yrs (current >5,500 samples)

Page 3: Normalization of Large-Scale Metabolomic Studies 2014

Need for NormalizationRemove non-biological (e.g. analytical) drift/variance/artifacts in measurements

Acquisition order Processing/acquisition batches

Page 4: Normalization of Large-Scale Metabolomic Studies 2014

Principal Component Analysis (PCA) of all analytes, showing QC sample scores

Batch EffectsDrift in >400 replicated measurements across >100 analytical batches for a single analyte

Acquisition batch

Abun

danc

e QCs embedded among >5,500 samples (1:10) collected over 1.5 yrs

If the biological effect size is less than the analytical variance

then the experiment will incorrectly yield insignificant results

Page 5: Normalization of Large-Scale Metabolomic Studies 2014

Estimation of Analytical VarianceReplicated measurements’ median inter- and intra-batch %RSD

Analyte specific performance across the whole study (inter-batch)

Within batch (intra-batch) performance

Page 6: Normalization of Large-Scale Metabolomic Studies 2014

Common Normalization ApproachesSample-wise scalar corrections

• L2 norm, mean, median, sum, etc.Quality control (QC) or reference sample

• Loess (Dunn et al.,2011; locally estimated scatterplot smoothing)

• Batch ratio (mean, median)

• Hierarchical mixed effects (Jauhiainen et al. 2014)

• Quantile (Bolstad et al., 2003; minimize variance in metabolite distribution)

Internal standard (ISTD) • Ratio response (metabolite/ISTD)

• NOMIS (Sysi-Aho et al., 2007; selection of optimal combination ISTDs)

• CCRMN (Redestig et al., 2009; removal of metabolite cross contribution to ISTDs)

Variance Based• RUV-2 (De Livera et al., 2012; variance removal for hypothesis testing)

• Variance stabilizing normalizations (Huber et al. 2002)

Page 7: Normalization of Large-Scale Metabolomic Studies 2014

Scalar Normalization

Assumption: equal X signal per sample where X can be sample sum, mean, median, etc.

• Can correct for batch effects when valid

• simple

Can hide true biological trends or create false ones

Page 8: Normalization of Large-Scale Metabolomic Studies 2014

LOESS Optimization, locally weighted non-liner model

LOESS span has a large effect model fit

span (α) controls the degree of smoothing

Page 9: Normalization of Large-Scale Metabolomic Studies 2014

LOESS Normalizationraw span =0.75 span =0.005

• Training Set• Test Set

Page 10: Normalization of Large-Scale Metabolomic Studies 2014

LOESS Normalization

Replicated measurements are use to optimize a locally weighted non-liner model:

1. Double cross-validation (33% test set)• Span optimized with k-fold cross-validation on the

training set 2. Model validated on the test set3. Use QC derived model to remove analytical variance

from samples

Image: http://pingax.com/regularization-implementation-r/?utm_source=rss&utm_medium=rss&utm_campaign=regularization-implementation-r

Page 11: Normalization of Large-Scale Metabolomic Studies 2014

LOESS ValidationAvoiding over fitting is critical

Training dataTest data

Page 12: Normalization of Large-Scale Metabolomic Studies 2014

Batch Ratio (BR) Normalization

Training SetTest Set

Calculate: • batch/analyte specific

correction factor = (batch median /global median)

• Apply ratio to samples

• simple

Page 13: Normalization of Large-Scale Metabolomic Studies 2014

Case Study I: TEDDY GC-TOF

• 310 metabolites for 4930 samples • 132 batches

• ~41 samples per batch • ~1:10 QCs/samples (487 QCs or 9%)• No Internal Standards (ISTDs)

Normalizations Implemented• Batch ratio • LOESS • Sum known metabolite signal (mTIC) normalization

Page 14: Normalization of Large-Scale Metabolomic Studies 2014

MedianRSD count cumulative %0-10 17 610-20 103 3920-30 112 7530-40 57 9340-50 12 9750-60 5 9960-70 3 10070-80 1 100

Normalization Performance

Median RSD count cumulative %

10-20 75 5720-30 51 9630-40 4 9940-50 1 9950-60 1 100

LOESS normalization showed optimal performance

Intra-batch

Inter-batch

Page 15: Normalization of Large-Scale Metabolomic Studies 2014

PCA of Normalization MethodsRaw LOESS

Batch RatioSum Normalized

Colors = ~3 months

Page 16: Normalization of Large-Scale Metabolomic Studies 2014

BR Normalization Limitations

Training SetTest Set

• Very susceptible to outliers

• Requires many QCs• Can inflate variance

when training and test set trends do not match

Page 17: Normalization of Large-Scale Metabolomic Studies 2014

LOESS Limitations

Training setTest Set

LOESS normalization can inflate variance when:• overtrained• training examples do

not match test set

Page 18: Normalization of Large-Scale Metabolomic Studies 2014

Case Study II: TEDDY LC-Q-TOF

• 340+ metabolites for 4930 samples • 132 batches

• ~41 samples per batch • ~1:10 QC/samples (524 QCs or 11%)• NIST reference (63 or 1%)• 14 internal standards (ISTDs)

• NOMIS (IS = ISTD)• qcISTD

Page 19: Normalization of Large-Scale Metabolomic Studies 2014

Internal Standards Normalization

Methods • qcISTD (QC optimized

metabolite/ISTD)

• NOMIS (Sysi-Aho et al., 2007; selection of optimal combination ISTDs)

• CCRMN (Redestig et al., 2009; removal of metabolite cross contribution to ISTDs)

NOMIS

Page 20: Normalization of Large-Scale Metabolomic Studies 2014

qcISTD Normalization

Use replicated measurements to define optimal internal standard/analyte ratio pairs

1. Double cross-validation (33% test set)

2. k-fold cross-validation on the training set

3. validate on the test set4. Apply QC defined

surrogate/analyte pairs to samples

Number of corrected analytes

Page 21: Normalization of Large-Scale Metabolomic Studies 2014

Comparison of Normalizations• qcISTD performs better than LOESS • qcISTD + LOESS leads to highest replicate precision

Intra-batch Inter-batch

Page 22: Normalization of Large-Scale Metabolomic Studies 2014

Raw (RSD = 13%) qcISTD (9%)

LOESS (12%)

qcISTD + LOESS (8%)Only LOESS included normalizations effectively remove analytical batch effects

PCA of Normalization Methods

Page 23: Normalization of Large-Scale Metabolomic Studies 2014

[email protected] metabolomics.ucdavis.edu

This research was supported in part by NIH 1 U24 DK097154