alignment and preprocessing for data analysisdepts.washington.edu/cpac/activities/meetings/... ·...

Alignment and Preprocessing for Data Analysis

• Preprocessing tools for chromatography

• Basics of alignment

• GC‐FID (1D) data and issues– PCA– F‐Ratios

• GC‐MS (2D) data and issues– PCA– F‐Ratios– PARAFAC

• Piecewise Alignment GUI (available online)– synoveclab.chem.washington.edu/Downloads.htm

– Email for username/password

Tools for Analysis: Classification

• Principal Component Analysis (PCA)

• Degree of Class Separation (DCS)

, ( ) ( )A B A B A BX X Y YD DCS

Score PC1

PCA Loadings`

Retention Time

Score PC1

ScoresX + Error

Why Align?

• Reduction in classification– PCA

• Increase in uncertainty for quantification– PARAFAC

• Misalignment occurs frequently– Daily instrument variation causes misalignment

– Correction is necessary to apply these methods

Retention Time Precision & PCA

DCS=32

10 Replicates withlimited shifting

10 Replicates withlarge shifting

Loadings

Retention Time

Score PC1

Loadings

Scores

Basics of Alignment

• Types of alignment algorithms– Cross correlation coefficient– Correlation Optimized Warping (COW)

– *Piecewise alignment

• Alignment Parameters– Window Size

– Shift

• Target Selection– PCA– Correlation Coefficient– Windowed Target

Alignment Algorithms

• Cross correlation coefficient– Move the entire chromatogram to maximize correlation

• Correlation Optimized Warping (COW)– Separate the chromatogram into windows

– Warp and move the windows to optimize the correlation

– Find the best alignment path to correct the data

• *Piecewise alignment– Separate the chromatogram into windows

– Shift the windows to optimize the correlation

– Find the best alignment path to correct the data

Alignment Parameters

Parameter

Window Size, W

Shift, L

Description EffectToo Small

Relative movement

high, difficult to determine

quality of alignment

Insufficient movement of

segments

EffectToo Large

Insufficient flexibility to correct peak

to peak shifting

Increases time

Window Size ~1 pk

0<shift<=maximum shift

Determining correct values

Alignment Metric

Target Selection

• *Global Approach– All chromatograms are initially collected

– *User chosen target– PCA optimized target

• Scores are produced for every sample

• The sample with the minimum distance from the center is the target

– *Maximum correlation target• Calculate the product of each chromatograms correlation to the others

• The maximum correlation is the target

• Online Approach– An initial target is set– Chromatograms are aligned as they are collected

– The target changes as new chromatograms are collected

Target Selection (Maximum Correlation)

0 2 4 6 8 10 120

8.05 8.1 8.15 8.2

0 5 10 15 200.1

0.6Sample # 9 is the target

8.05 8.1 8.15 8.2

Type 1Type 2Type 3

TargetType 1Type 2Type 3

Retention Time, min

Sample #

Alignment of GC‐FID (1D) Data and Issues

• Removal of artifacts and solvent peaks

• Baseline correction and normalization

• Alignment

• Improving PCA– F‐Ratio

Preprocessing Tools for Chromatography

• Noise filtering– Median filter

• Baseline correction• Normalization

• Alignment

Experimental

• GC‐FID Separation• 4 diesel sample types

• 9 minute separation

• 17 replicate injections over 5 days

7 7.2 7.4 7.6 7.8

Noise Filtering (Solvent, Spikes, and Outliers)

2 4 6 8 10

7 7.2 7.4 7.6 7.80.2

Median Filtered Data

(minimize spike)

Retention Time, min

GC-FID Separation

Retention Time, min

Noise Spike

0 2 4 6 80

0.8 1 1.2 1.4 1.6 1.8 2 2.2x 10

1.5 x 108

Outlier (low signal)

Solvent Dominated Info

Retention Time, min

GC-FID Separation

Scores Plot

Solvent

0 2 4 6 8 100

1.5 2 2.5 3 3.5 4 4.5 5x 10

5 x 106

Outlier (Low Signal)

Outlier Low Signal

Retention Time, min

GC-FID Separation

Scores Plot

0 2 4 6 8 100

No Outlier

3.4 3.6 3.8 4 4.2 4.4 4.6x 10

5 x 106

No Outlier (Baseline /

Normalization on PC1)

Retention Time, min

GC-FID Separation

Scores Plot

6.2 6.3 6.4 6.5 6.6 6.7 6.8x 10

8 x 10-4

0 2 4 6 8 10-0.5

2.5 x 10-9

Tight Clustering in PCA

Baseline Corrected and Normalized Data

Subtract offset line from data

Then divide by total signal

Retention Time, min

Scores Plot

Experimental

• GC Separation• 3 gasoline sample types

• 6 replicate injections over 2 days• Misalignment is due to day to day instrument

variation

Alignment GUI

Initially blank to guide the user through alignment

Load from workspace

Load from file

Alignment GUI

Zoom & Pan

Choose or Estimate TargetChoose or Estimate Parameters

Load & View Data Sizes

Align Data

Alignment GUI

Now save data to

workspace or .mat file

Data after alignment

Data before

alignment

Estimated

target and

parameters

Alignment GUI

PCA Classification of Gasoline

0.038 0.0385 0.039 0.0395 0.04 0.0405 0.041-5

3 x 10-3

8.05 8.1 8.15 8.2 8.25 8.3 8.35

x 10-8

We want better

separation of groups

Retention Time, min

Scores Plot

Sample 1

Sample 2

Sample 3

GC‐MS

Separations

Fisher Ratio Method (F‐Ratio)

Works well for samples that have large amounts of within

class variance

Works best when comparing a small number of sample

classes

For each mass channel calculate Fisher Ratio at each point in 2D

space,

Fisher Ratio =2

2b tw c la s s

w i th in c la s s

Fisher Ratios

0 5 10 150

8.05 8.1 8.15 8.2 8.25 8.3 8.35

x 10-8

No alignment

After alignment

Select Top Ten

Retention Time, min

Selected Chromatographic Region

Retention Time, min

0 5 10 15-1

7 x 10-8

5 6 7 8 9 10x 10

2.5 x 10-3

Retention Time, min

Scores Plot

Regions of interest

Tight clustering of sample types

Summary

• Removal of solvents or artifacts is essential

• Baseline correction is an important step

• Alignment is essential for improving classification

• The F‐Ratio algorithm can further improve classification

Alignment of GC‐MS (2D) Data and Issues

• Removal of artifacts and solvent peaks

• Baseline correction and normalization

• Alignment

• Improving PCA– F‐Ratio

• PARAFAC

Experimental

• GC‐MS Separation

• 3 gasoline sample types

variation

Alignment GUI

Initially blank to guide the user through alignment

Load from workspace

Load from file

Alignment GUI

Zoom & Pan

Choose or Estimate TargetChoose or Estimate Parameters

Load & View Data Sizes

Align Data

Alignment GUI

Now save data to

workspace or .mat file

Data after alignment

Data before

alignment

Estimated

target and

parameters

Alignment GUI

Alignment

Retention Time

ShiftAlignment

Total Ion Current Shift Function (TIC‐SF)SampleTarget

Retention Time

Total Ion Current Shift Function (TIC‐SF)

GC‐MS Data

Retention Time

alignment

Retention Time

SampleTarget

GC‐MS Data

Alignment

7.1 7.15 7.2 7.25 7.30

x 10-8

7.1 7.15 7.2 7.25 7.30

x 10-8

0 5 10 15-1

7 x 10-8

Alignment

Retention Time, min

7.05 7.1 7.15 7.2 7.250

0.013 0.0135 0.014 0.0145 0.015-2

1.5x 10-3

PCA w/ full MS

Classification of Full MS Data

Retention Time, min

Scores Plot

Sample 1m/z

Sample 2

Sample 3

Fisher Ratios

for Samples

GC‐MS

Separations

Fisher Ratio Method

Works well for samples that have large amounts of within

class variance

Works best when comparing a small number of sample

classes

For each mass channel calculate Fisher Ratio at each point in 2D

space,

Fisher Ratio =2

2b tw c la s s

w i th in c la s s

Sum Fisher Ratios along mass

channel dimension

Simulated Example Using F‐Ratios for PCA

W = 400L = 1000 Align

Retention Time

Scores 2D F‐Ratios

Shows differences

between all samples

TICTIC

F-Ratios

4.8 5 5.2 5.4 5.640

0 5 10 1540

0 5 10 15-1

7 x 10-8

F‐Ratios

Top Fisher Ratios

GC‐MS Fisher Ratios

GC‐MS Separation

Retention Time, min

Procedure

• Select masses using mass spectral information

• Select time regions using F‐Ratios

• Combine to reduce the data set

• Improve results of PCA

0 5 10 15-1

7 x 10-8

Classification of Full MS Data

1.8 1.9 2 2.1 2.2 2.3 2.4x 10

1.5x 10-4

Clear Separation on 1st

Scores Plot of Nearest Samples

Retention Time, min

Regions of interest (with MS values)

Quantification of GC‐MS Data

• Use aligned, baseline corrected and normalized data

• Use PARAFAC of small regions for analysis– Match values to mass spectra

– Peak Sums

– Peak Profiles

Quantification

• Target analyte Parallel Factor Analysis (PARAFAC) isolates the

pure component peak and mass spectral information from

overlapping peaks and background for both identification

quantification.

Data Matrix = Analyte 1 + Analyte 2 + ... Noisex x

+ +…+=

z1 z 2

Compound : malate, TMS# of factors: 3Chosen comp.: #3Match value : 948Peak sum : 4413882Peak height : 376188

PARAFAC Results

Col. 1 Time Col. 2 Time

Intensity

1.0 1.2 1.4 1.6

PARAFAC for GC‐MS Data

Stack Samples

Retention Time

Sample

Run PARAFAC

Peak Profile

Mass Spectrum

Analyte Concentration

Sample 1

Sample 2

Sample 1

Sample 2

Experimental

• GC‐MS Separation

• 3 gasoline sample types

variation

• Looking for isobutyl benzene in the gasoline

Isobutyl Benzene Spectrum

20 40 60 80 100 120 140 1600

7.04 7.06 7.08 7.1 7.12 7.14 7.16 7.180

6 x 10-9

0 5 10 15 20 25 30 350

9 x 10-3

0 5 10 15 200

3.5 x 10-3

20 40 60 80 100 120 140 1600

7 x 10-5

PARAFAC Results for GC‐MS Data

Retention Time, min

Quantitative Info

Qualitative Info

Mass Spectrum

Peak Profile

Peak Sum

Selected Analysis Region for PARAFAC 1

20 40 60 80 100 120 140 1600

=403 MVIBB

=788 MVIBB

PARAFAC Results for GC‐MS Data

Component 1 Component 2 Component 3

Best match to isobutyl benzene

Summary

• Mass spectral chromatographic data should be aligned

• An available alignment algorithm is able to align 1‐D and 2‐D data

• F‐Ratio methods can be used to improve classification

• PARAFAC can be effectively used to separate overlapped analytes

alignment and preprocessing for data analysisdepts.washington.edu/cpac/activities/meetings/... ·...

Documents

final cpac 2015 schedule

data preprocessing data preprocessing...

cpac schedule of events

cpac straw poll 2013

cycling and pedestrian advisory committee (cpac) 2017...

presentacio cpac 2013

cpac hollow core

cpac info - accueil | cpac

welcome to cpacwelcome to cpac - cpac | bringing the arts

presentation to cpac - university of...

data preprocessing

summer at cpac!summer at cpac!

cpac catalog

coming soon at cpac!

importance of data preprocessing for improving...

cpac summer institute - university of...

heating & cooling cold plate air cooled - inheco.com ·...

preprocessing and dimensionality...

acu cpac schedule

cpac brand guidelines