alignment and preprocessing for data analysisdepts.washington.edu/cpac/activities/meetings/... ·...
Post on 22-Mar-2020
40 Views
Preview:
TRANSCRIPT
Alignment and Preprocessing for Data Analysis
• Preprocessing tools for chromatography
• Basics of alignment
• GC‐FID (1D) data and issues– PCA– F‐Ratios
• GC‐MS (2D) data and issues– PCA– F‐Ratios– PARAFAC
• Piecewise Alignment GUI (available online)– synoveclab.chem.washington.edu/Downloads.htm
– Email for username/password
Tools for Analysis: Classification
• Principal Component Analysis (PCA)
• Degree of Class Separation (DCS)
,
22
A B
A B
DCSD
s s
2 2
, ( ) ( )A B A B A BX X Y YD DCS
Scor
e PC
2
Score PC1
DCS
PCA Loadings`
Retention Time
Retention Time
Sign
al
Scor
e PC
2
Score PC1
ScoresX + Error
Why Align?
• Reduction in classification– PCA
• Increase in uncertainty for quantification– PARAFAC
• Misalignment occurs frequently– Daily instrument variation causes misalignment
– Correction is necessary to apply these methods
Retention Time Precision & PCA
DCS=32
DCS=3
PCA
PCA
10 Replicates withlimited shifting
10 Replicates withlarge shifting
Loadings
Retention Time
Retention Time
Retention Time
Retention Time
Sign
alSi
gnal
__
Scor
e PC
2
Score PC1
Score PC1
Scor
e PC
2
Loadings
Scores
Scores
Basics of Alignment
• Types of alignment algorithms– Cross correlation coefficient– Correlation Optimized Warping (COW)
– *Piecewise alignment
• Alignment Parameters– Window Size
– Shift
• Target Selection– PCA– Correlation Coefficient– Windowed Target
Alignment Algorithms
• Cross correlation coefficient– Move the entire chromatogram to maximize correlation
• Correlation Optimized Warping (COW)– Separate the chromatogram into windows
– Warp and move the windows to optimize the correlation
– Find the best alignment path to correct the data
• *Piecewise alignment– Separate the chromatogram into windows
– Shift the windows to optimize the correlation
– Find the best alignment path to correct the data
Alignment Parameters
Parameter
Window Size, W
Shift, L
Description EffectToo Small
Relative movement
high, difficult to determine
quality of alignment
Insufficient movement of
segments
EffectToo Large
Insufficient flexibility to correct peak
to peak shifting
Increases time
Window Size ~1 pk
width
0<shift<=maximum shift
Determining correct values
Alignment Metric
Alignment Metric
L tR
Target Selection
• *Global Approach– All chromatograms are initially collected
– *User chosen target– PCA optimized target
• Scores are produced for every sample
• The sample with the minimum distance from the center is the target
– *Maximum correlation target• Calculate the product of each chromatograms correlation to the others
• The maximum correlation is the target
• Online Approach– An initial target is set– Chromatograms are aligned as they are collected
– The target changes as new chromatograms are collected
Target Selection (Maximum Correlation)
0 2 4 6 8 10 120
10
20
30
40
50
60
8.05 8.1 8.15 8.2
0
1
2
3
4
5
6
7
8
9
10
0 5 10 15 200.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6Sample # 9 is the target
8.05 8.1 8.15 8.2
0
2
4
6
8
10
Type 1Type 2Type 3
TargetType 1Type 2Type 3
Retention Time, min
Sign
al
Retention Time, min
Sign
al
Sample #
Pro
duct
Cor
rela
tion
Alignment of GC‐FID (1D) Data and Issues
• Removal of artifacts and solvent peaks
• Baseline correction and normalization
• Alignment
• Improving PCA– F‐Ratio
Preprocessing Tools for Chromatography
• Noise filtering– Median filter
• Baseline correction• Normalization
• Alignment
Experimental
• GC‐FID Separation• 4 diesel sample types
• 9 minute separation
• 17 replicate injections over 5 days
7 7.2 7.4 7.6 7.8
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Noise Filtering (Solvent, Spikes, and Outliers)
2 4 6 8 10
0
2
4
6
8
10
12
14
7 7.2 7.4 7.6 7.80.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Median Filtered Data
(minimize spike)
Retention Time, min
Sign
al
GC-FID Separation
Sign
alSi
gnal
Retention Time, min
Noise Spike
Noise Spike
Noise Filtering (Solvent, Spikes, and Outliers)
0 2 4 6 80
2
4
6
8
10
12
0.8 1 1.2 1.4 1.6 1.8 2 2.2x 10
9
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5 x 108
Outlier (low signal)
Solvent Dominated Info
Retention Time, min
Sign
al
GC-FID Separation
PC1PC
2
Scores Plot
Solvent
Noise Filtering (Solvent, Spikes, and Outliers)
0 2 4 6 8 100
2
4
6
8
10
12
14
1.5 2 2.5 3 3.5 4 4.5 5x 10
7
-3
-2
-1
0
1
2
3
4
5 x 106
Outlier (Low Signal)
Outlier Low Signal
Retention Time, min
Sign
al
GC-FID Separation
PC1PC
2
Scores Plot
0 2 4 6 8 100
2
4
6
8
10
12
14
Noise Filtering (Solvent, Spikes, and Outliers)
No Outlier
3.4 3.6 3.8 4 4.2 4.4 4.6x 10
7
-3
-2
-1
0
1
2
3
4
5 x 106
No Outlier (Baseline /
Normalization on PC1)
Retention Time, min
Sign
al
GC-FID Separation
PC1PC
2
Scores Plot
Noise Filtering (Solvent, Spikes, and Outliers)
6.2 6.3 6.4 6.5 6.6 6.7 6.8x 10
-3
-4
-2
0
2
4
6
8 x 10-4
0 2 4 6 8 10-0.5
0
0.5
1
1.5
2
2.5 x 10-9
Tight Clustering in PCA
Baseline Corrected and Normalized Data
Subtract offset line from data
Then divide by total signal
Retention Time, min
Nor
mal
ized
Sig
nal
PC1PC
2
Scores Plot
Experimental
• GC Separation• 3 gasoline sample types
• 15 minute separation
• 6 replicate injections over 2 days• Misalignment is due to day to day instrument
variation
Alignment GUI
Initially blank to guide the user through alignment
Load from workspace
Load from file
Alignment GUI
Zoom & Pan
Choose or Estimate TargetChoose or Estimate Parameters
Load & View Data Sizes
Align Data
Alignment GUI
Now save data to
workspace or .mat file
Data after alignment
Data before
alignment
Estimated
target and
parameters
Alignment GUI
PCA Classification of Gasoline
0.038 0.0385 0.039 0.0395 0.04 0.0405 0.041-5
-4
-3
-2
-1
0
1
2
3 x 10-3
8.05 8.1 8.15 8.2 8.25 8.3 8.35
0
0.5
1
1.5
2
2.5
3
x 10-8
We want better
separation of groups
Retention Time, min
Sign
al
PC1PC
2
Scores Plot
Sample 1
Sample 2
Col 1
Sample 3
GC‐MS
Separations
Fisher Ratio Method (F‐Ratio)
Works well for samples that have large amounts of within
class variance
Works best when comparing a small number of sample
classes
For each mass channel calculate Fisher Ratio at each point in 2D
space,
Fisher Ratio =2
2b tw c la s s
w i th in c la s s
Fisher Ratios
Col 1
0 5 10 150
20
40
60
80
100
120
140
160
180
200
0 5 10 150
100
200
300
400
500
600
700
800
Fisher Ratio Method (F‐Ratio)
8.05 8.1 8.15 8.2 8.25 8.3 8.35
0
0.5
1
1.5
2
2.5
3
x 10-8
No alignment
After alignment
Select Top Ten
Retention Time, min
Sign
al
Selected Chromatographic Region
Retention Time, min
F-R
atio
Retention Time, min
F-R
atio
0 5 10 15-1
0
1
2
3
4
5
6
7 x 10-8
Fisher Ratio Method (F‐Ratio)
5 6 7 8 9 10x 10
-3
-1
-0.5
0
0.5
1
1.5
2
2.5 x 10-3
Retention Time, min
Sign
al
PC1PC
2
Scores Plot
Regions of interest
Tight clustering of sample types
Summary
• Removal of solvents or artifacts is essential
• Baseline correction is an important step
• Alignment is essential for improving classification
• The F‐Ratio algorithm can further improve classification
Alignment of GC‐MS (2D) Data and Issues
• Removal of artifacts and solvent peaks
• Baseline correction and normalization
• Alignment
• Improving PCA– F‐Ratio
• PARAFAC
Experimental
• GC‐MS Separation
• 3 gasoline sample types
• 15 minute separation
• 6 replicate injections over 2 days• Misalignment is due to day to day instrument
variation
Alignment GUI
Initially blank to guide the user through alignment
Load from workspace
Load from file
Alignment GUI
Zoom & Pan
Choose or Estimate TargetChoose or Estimate Parameters
Load & View Data Sizes
Align Data
Alignment GUI
Now save data to
workspace or .mat file
Data after alignment
Data before
alignment
Estimated
target and
parameters
Alignment GUI
0 2 4
5
10
15
20
Alignment
TIC
Retention Time
Retention Time
ShiftAlignment
Total Ion Current Shift Function (TIC‐SF)SampleTarget
0 2 4
2
4
6
8
10
12
Retention Time
Shift
Total Ion Current Shift Function (TIC‐SF)
GC‐MS Data
Retention Time
Apply
alignment
0 2 4
2
4
6
8
10
12
Retention Time
SampleTarget
GC‐MS Data
Alignment
7.1 7.15 7.2 7.25 7.30
0.5
1
1.5
2
2.5
x 10-8
7.1 7.15 7.2 7.25 7.30
0.5
1
1.5
2
2.5
x 10-8
0 5 10 15-1
0
1
2
3
4
5
6
7 x 10-8
Align
Alignment
Retention Time, min
TIC
Sig
nal
TIC
Sig
nal
Retention Time, min
TIC
Sig
nal
7.05 7.1 7.15 7.2 7.250
0.5
1
1.5
2
2.5
3x 10
-8
0.013 0.0135 0.014 0.0145 0.015-2
-1.5
-1
-0.5
0
0.5
1
1.5x 10-3
PCA w/ full MS
Classification of Full MS Data
Retention Time, min
TIC
Sig
nal
PC1PC
2
Scores Plot
Sample 1m/z
Col 1
m/z
Col 1
Sample 2
m/z
Col 1
Sample 3
Fisher Ratios
for Samples
m/z
Col 1
GC‐MS
Separations
Fisher Ratio Method
Works well for samples that have large amounts of within
class variance
Works best when comparing a small number of sample
classes
For each mass channel calculate Fisher Ratio at each point in 2D
space,
Fisher Ratio =2
2b tw c la s s
w i th in c la s s
Sum Fisher Ratios along mass
channel dimension
Col 1
Simulated Example Using F‐Ratios for PCA
PCA
W = 400L = 1000 Align
Retention Time
m/z
Scores 2D F‐Ratios
Shows differences
between all samples
TICTIC
F-Ratios
GC-M
S Da
ta
4.8 5 5.2 5.4 5.640
50
60
70
80
90
100
100
200
300
400
500
600
700
0 5 10 1540
60
80
100
120
140
100
200
300
400
500
600
700
0 5 10 15-1
0
1
2
3
4
5
6
7 x 10-8
F‐Ratios
Top Fisher Ratios
GC‐MS Fisher Ratios
GC‐MS Fisher Ratios
GC‐MS Separation
Retention Time, min
TIC
Sig
nal
Procedure
• Select masses using mass spectral information
• Select time regions using F‐Ratios
• Combine to reduce the data set
• Improve results of PCA
0 5 10 15-1
0
1
2
3
4
5
6
7 x 10-8
Classification of Full MS Data
1.8 1.9 2 2.1 2.2 2.3 2.4x 10
-3
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5x 10-4
Clear Separation on 1st
PC
PC1P
C2
Scores Plot of Nearest Samples
Retention Time, min
Sign
al
Regions of interest (with MS values)
Quantification of GC‐MS Data
• Use aligned, baseline corrected and normalized data
• Use PARAFAC of small regions for analysis– Match values to mass spectra
– Peak Sums
– Peak Profiles
Quantification
• Target analyte Parallel Factor Analysis (PARAFAC) isolates the
pure component peak and mass spectral information from
overlapping peaks and background for both identification
and
quantification.
Data Matrix = Analyte 1 + Analyte 2 + ... Noisex x
+ +…+=
1 2
y1 y2
z1 z 2
R E
Compound : malate, TMS# of factors: 3Chosen comp.: #3Match value : 948Peak sum : 4413882Peak height : 376188
PARAFAC Results
Col. 1 Time Col. 2 Time
Intensity
Intensity
1.0 1.2 1.4 1.6
0
2
4
6
8
PARAFAC for GC‐MS Data
0 2 4
2
4
6
8
10
12
14
0 2 4
2
4
6
8
10
12
Stack Samples
Retention Time
Sample
Run PARAFAC
Peak Profile
Mass Spectrum
Analyte Concentration
Sample 1
Sample 2
Sample 1
Sample 2
Experimental
• GC‐MS Separation
• 3 gasoline sample types
• 15 minute separation
• 6 replicate injections over 2 days• Misalignment is due to day to day instrument
variation
• Looking for isobutyl benzene in the gasoline
Isobutyl Benzene Spectrum
20 40 60 80 100 120 140 1600
10
20
30
40
50
60
70
80
90
100
Mass
Sign
al
7.04 7.06 7.08 7.1 7.12 7.14 7.16 7.180
1
2
3
4
5
6 x 10-9
0 5 10 15 20 25 30 350
1
2
3
4
5
6
7
8
9 x 10-3
0 5 10 15 200
0.5
1
1.5
2
2.5
3
3.5 x 10-3
20 40 60 80 100 120 140 1600
1
2
3
4
5
6
7 x 10-5
PARAFAC Results for GC‐MS Data
Retention Time, min
TIC
Sig
nal
Quantitative Info
Qualitative Info
Qualitative Info
Mass Spectrum
Peak Profile
Peak Sum
Selected Analysis Region for PARAFAC 1
2
3
1
2
3
1
20 40 60 80 100 120 140 1600
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
20 40 60 80 100 120 140 1600
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
20 40 60 80 100 120 140 1600
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
MVIBB
=403 MVIBB
=788 MVIBB
=910
PARAFAC Results for GC‐MS Data
Component 1 Component 2 Component 3
Best match to isobutyl benzene
Summary
• Mass spectral chromatographic data should be aligned
• An available alignment algorithm is able to align 1‐D and 2‐D data
• F‐Ratio methods can be used to improve classification
• PARAFAC can be effectively used to separate overlapped analytes
top related