prediction and simulation of mass spectrafiehnlab.ucdavis.edu/downloads/staff/...course-7.pdf5...

27
1 Welcome! Mass Spectrometry meets Cheminformatics WCMC Metabolomics Course 2013 Tobias Kind Course 7: Prediction and simulation of mass spectra Biology Chemistry Informatics http://fiehnlab.ucdavis.edu/staff/kind CC-BY License

Upload: others

Post on 24-May-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

1

Welcome!

Mass Spectrometry meets Cheminformatics

WCMC Metabolomics Course 2013

Tobias Kind

Course 7: Prediction and simulation

of mass spectra

Biology

Chemistry

Informatics

http://fiehnlab.ucdavis.edu/staff/kind CC-BY License

2

History of artificial intelligence and mass spectrometry

Dendral project at Stanford University (USA)

Started in 1960s

Pioneered approaches in artificial intelligence (AI)

Aim:

Prediction of isomer structures from mass spectra

Idea: Self-learning or intelligent algorithm

Participants:

Lederberg, Sutherland, Buchanan, Feigenbaum,

Duffield, Djerassi, Smith, Rindfleisch, many others…

Status:

Failed; but inspirational until today (50 years later)

[Dendral PDF]

Figure: Heuristic DENDRAL:

A Program for Generating Explanatory Hypotheses in Organic Chemistry

Overview: DENDRAL: a case study of the first expert system for scientific

hypothesis formation; Artificial Intelligence 61 (1993) 209-261

3

Prediction and simulation of mass spectra

A) Prediction of the isomer structure or substructures from a given mass spectrum

The structure is directly deduced from the mass spectrum or generated by

a molecular isomer generator or existing structures can be found in a structure database

B) Simulation of a mass spectrum from a given isomer structure

The mass spectral peaks and abundances are generated by a machine learning algorithm

The structures can be obtained from a isomer database (PubChem, LipidMaps)

or a sequence database (Swiss-Prot, NCBI) in case of proteins

(mainlib) Coronene

40 60 80 100 120 140 160 180 200 220 240 260 280 3000

50

100

100 122 136

150

168 222 246 268

300

(mainlib) Coronene

40 60 80 100 120 140 160 180 200 220 240 260 280 3000

50

100

100 122 136

150

168 222 246 268

300

4

Prediction of substructures from mass spectra

Picture source: amdis.net

Working examples for EI mass spectra:

Varmuza classifiers in AMDIS and MOLGEN-MS

Substructure algorithm (Stein S.E.)

Implemented in NIST-MS search program

Mass spectral classifiers for supporting systematic structure elucidation

Varmuza K., Werther W., J. Chem. Inf. Comput. Sci., 36, 323-333 (1996).

Chemical Substructure Identification by Mass Spectral Library Searching

S.E. Stein, J. Am. Soc. Mass Spectrom., 1995, 6, (644-655)

5

Substructures deduced from mass spectra for

generation of isomer structures

Picture source: amdis.net

1) Molecular formula must be known - can be detected from molecular ion and isotopic pattern

2) Good-list (substructure exists) and bad-list (substructure not existent) approach

3) Sub-structures are combined in deterministic or stochastic (random) manner

4) Database or molecular isomer generator (combinatorial, graph theory) approach for

generating or finding possible structure candidates

Example:

Molecular formula C6ClH5O;

calculated from molecular ion

Goodlist:

Badlist:

Database (Chemspider): 25 hits

(including all possible existing structures)

MOLGEN Demo:

All constructed isomers: 8372

-benzene

-hydroxy

-chlorine

Total: 3 possible results

6

Aristo – ontology classification of EI mass spectra

http://www.ionspectra.org/aristo/

7

Application: Decision tree supported substructure

prediction of metabolites from GC-MS profiles

Source: Metabolomics. 2010 Jun;6(2):322-333. Epub 2010 Feb 16.

Decision tree supported substructure prediction of metabolites from GC-MS profiles.

Hummel J, Strehmel N, Selbig J, Walther D, Kopka J.

Decision tree Spectrum Compound structure

http://gmd.mpimp-golm.mpg.de/

8

Submit: mass spectral information.

Result: Ranked hit list of molecules

http://msbi.ipb-halle.de/MetFrag/

METFRAG: In silico fragmentation for mass spectral identification

9

Simulation of mass spectra

Why is simulation of mass spectral fragmentation important?

Imagine – you have a structure database of all molecules

Imagine – you can simulate mass spectra for all these molecules

Imagine – you can match your experimental spectra against a database of calculated spectra

Isomer DB Heuristic or de-novo

algorithm to generate

MS or MS/MS spectra MS DB

of theoretical spectra

10 30 50 70 90 110 130 150 170 1900

50

10031

43

60

73

119

131144

10 30 50 70 90 110 130 150 170 1900

50

10031

43

60

73

119

131144

Theoretical spectrum is depiction only, not truly simulated.

(*)

(*)

10

General methods for simulation of mass spectra

Heuristics or

rule based algorithms

Ab-initio or de-novo

first principle methods

Superior, because they solve

the problem at the root;

Slower to implement

Solve practical problems;

Faster to implement

EI-MS

CID-MS

CID-MS/MS

LipidBlast

QCEIMS

QCEIMS: Towards first principles calculation of electron impact mass spectra of molecules.; Grimme S.; Angew Chem Int Ed Engl. 2013 Jun

10;52(24):6306-12. doi: 10.1002/anie.201300158;

LipidBlast: LipidBlast in silico tandem mass spectrometry database for lipid identification. Kind T, Liu KH, Lee do Y, Defelice B, Meissen JK, Fiehn O.

Nature Methods. 2013 Aug;10(8):755-8. doi: 10.1038/nmeth.2551.

Approach Domain Example

11

Simulation of alkane mass spectra (I)

Approach

Use of artificial neural networks (ANN) (machine learning)

Electron impact spectra 70 eV

Substructure descriptors were used for calculation

Selection of 44 m/z positions – training was performed for correct intensity

117 noncyclic alkanes and 145 noncyclic alkenes

training set: 236 molecules

prediction set: 26 compounds

Problems

Prediction or validation set very small (should be 30%)

Prediction of molecular ion (usually very low abundant)

Overfitting possible, works only for selected substance classes

Source: WIKI

Source: Jalali-Heravi M. and Fatemi M. H.; Simulation of mass spectra of noncyclic alkanes and alkenes using artificial neural network

12

Simulation of alkane mass spectra (II)

Source: Jalali-Heravi M. and Fatemi M. H.; Simulation of mass spectra of noncyclic alkanes and alkenes using artificial neural network

Analytica Chimica Acta; Elsevier permission use for coursepack/classroom material

2,3,3-trimethylpentane (a and b) and 2,3,4-trimethylpentane (c and d). OKVWYBALHQFVFP-UHFFFAOYAT RLPGDEORIPLBNF-UHFFFAOYAR

Structures: Chemspider

13

Simulation or prediction of oligosaccharide spectra

(carbohydrate sequencing)

See Oscar and FragLib

See GlySpy

Source: Congruent Strategies for Carbohydrate Sequencing.

3. OSCAR: An Algorithm for Assigning Oligosaccharide Topology from MSn Data

http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1435829

Consistent building blocks (sugars)

Consistent fragmentation allows in-silico fragment prediction

Pre-calculated fragments from known structures can be stored in database (use NIST-MS-Search)

Algorithm works also on-the-fly without database

De-novo algorithms work for truly unknown structures

14

Simulation of peptide fragmentations

(De-novo sequencing of peptides)

Principle:

De-novo sequencing of peptides (determine amino acid sequences)

De-novo algorithms can perform permutations and combinatorial calculations

from all 20 amino acids (superior if the sequence is not found in a database)

Highly dependent on good mass accuracy (less than 1 ppm) of precursor ion and MS/MS fragments

Generate match score by matching in-silico fragments against experimental MS/MS spectrum

Problems:

Leucine and isoleucine have same mass

Post translational modifications (PMTs)

Missing fragment peaks

Picture source: MWTWIN help file2 (Monroe/PNNL)

Picture 2 source: Tandem mass spectrometry data quality assessment by self-convolution

Keng Wah Choo and Wai Mun Tham http://www.biomedcentral.com/1471-2105/8/352

In-silico fragmentation with MassFrontier

using fragmentation library of 20,000 mechanisms from literature

Result fragmentations are represented as bar code spectra (same abundance)

16

Simulation of lipid tandem mass spectra (I)

Picture: Thanks to Yetukuri et al. BMC Systems Biology 2007 1:12 doi:10.1186/1752-0509-1-12

Single examples

Similar structures; plus CH2 in side chains sn1 and sn2; double bonds possible

Similar and almost constant fragmentation rules

Loss of head group (diagnostic ion in MS and MS/MS spectrum)

Loss of rest one (R1) and rest two (R2) can be observed in MS/MS spectrum

17

Combinatorial scaffold library design

sn1 = alkyl or acyl rest

sn2 = alkyl or acyl rest

head group

C10

C12

C14

C16

Scaffold (conserved) Linker Functional group (variable)

+ LipidMaps nomenclature name generation

+ accurate isotopic fragment calculation

+ mass spectral peak annotation

+ heuristic peak abundance modeling (CID voltage dependent)

+ conversion into mass spectral library format

choline

inositol

glycerol

Source: LipidBlast

18

Simulation of lipid tandem mass spectra (II)

Spectrum Source:Lipidmaps.org

C45H82NO8PGPCho269.2481303.2324526.3297544.3403492.3453510.355920:4(5Z,8Z,11Z,14Z)/17:0437796.5856

C45H82NO8PGPCho303.2324269.2481492.3453510.3559526.3297544.340317:0/20:4(5Z,8Z,11Z,14Z)437796.5856

C43H74NO10PGPSer269.2481301.2168526.2569544.2675494.2882512.298820:5(5Z,8Z,11Z,14Z,17Z)/17:0537796.5128

C43H74NO10PGPSer301.2168269.2481494.2882512.2988526.2569544.267517:0/20:5(5Z,8Z,11Z,14Z,17Z)537796.5128

C40H77O13PGPIns227.2011269.2481569.309587.3196527.2621545.272717:0/14:0031797.5180

C40H77O13PGPIns269.2481227.2011527.2621545.2727569.309587.319614:0/17:0031797.5180

FormulaHGsn2 acid(-)sn1 acid(-)M-sn2-H2O+HM-sn2+HM-sn1-H2O+HM-sn1+HAbbrev.DBCMass

C45H82NO8PGPCho269.2481303.2324526.3297544.3403492.3453510.355920:4(5Z,8Z,11Z,14Z)/17:0437796.5856

C45H82NO8PGPCho303.2324269.2481492.3453510.3559526.3297544.340317:0/20:4(5Z,8Z,11Z,14Z)437796.5856

C43H74NO10PGPSer269.2481301.2168526.2569544.2675494.2882512.298820:5(5Z,8Z,11Z,14Z,17Z)/17:0537796.5128

C43H74NO10PGPSer301.2168269.2481494.2882512.2988526.2569544.267517:0/20:5(5Z,8Z,11Z,14Z,17Z)537796.5128

C40H77O13PGPIns227.2011269.2481569.309587.3196527.2621545.272717:0/14:0031797.5180

C40H77O13PGPIns269.2481227.2011527.2621545.2727569.309587.319614:0/17:0031797.5180

FormulaHGsn2 acid(-)sn1 acid(-)M-sn2-H2O+HM-sn2+HM-sn1-H2O+HM-sn1+HAbbrev.DBCMass

Experimental

Mass spectrum

In-silico prediction

of MS/MS mass spectral fragments

Simulation of tandem mass spectra

or MS/MS fragment data from

LipidMaps

19

LipidBlast MS/MS mass spectral modeling

In-silico mass spectra:

• m/z fragments and abundance calculation required

• statistical (computer derived) and heuristic rules (experience of a human expert) 19

732.555 [Da].dta PC 32:1; [M+H]+; GPCho(16:0/ 16:1(7Z))Head to Tail MF=437 RMF=666

360 380 400 420 440 460 480 500 520 540 560 580 600 620 640 660 680 700 720 740

0

50

100

50

100

393.40846996748957

496.3987288063696

549.406572225188 613.2630013581689

673.3751991201706

713.4246597050723

673.48083

Experimental MS/MS

precursor m/z = 732.55

in-silico MS/MS match

precursor m/z = 732.55

[M+H]-H2O (-18)

[M+H]-C3H9N (-59)

[M+H]-C5H14NO4P (-183)

[M+H]-sn2

[M+H]-sn1

[M+H]-sn2-H2O

[M+H]-sn1-H2O

20

LipidBlast in-silico modeling of lipid tandem mass spectra

Number MS/MS spectra Number

Number LipidClass Short Number compounds with different adducts MS/MS LIBS

1 Phosphatidylcholines PC 5476 10952 2

2 Lysophosphatidylcholines lysoPC 80 160 2

3 Plasmenylphosphatidylcholines plasmenyl-PC 222 444 2

4 Phosphatidylethanolamines PE 5476 16428 3

5 Lysophosphatidylethanolamines lysoPE 80 240 3

6 Plasmenylphosphatidylethanolamines plasmenyl-PE 222 666 3

7 Phosphatidylserines PS 5123 15369 3

8 Sphingomyelines SM 168 336 2

9 Phosphatidic acids PA 5476 16428 3

10 Phosphatidylinositols PI 5476 5476 1

11 Phosphatidylglycerols PG 5476 5476 1

12 Cardiolipins CL 25426 50852 2

13 Ceramide-1-phosphates CerP 168 336 2

14 Sulfatides ST 168 168 1

15 Gangliosides [glycan]-Cer 880 880 1

16 Monoacylglycerols MG 74 148 2

17 Diacylglycerols DG 1764 3528 2

18 Triacylglycerols TG 2640 7920 3

19 Monogalactosyldiacylglycerols MGDG 5476 21904 4

20 Digalactosyldiacylglycerols DGDG 5476 10952 2

21 Sulfoquinovosyldiacylglycerols SQDG 5476 5476 1

22

Diacylated phosphatidylinositol

monomannoside Ac2PIM1 144 144 1

23 Diacylated phosphatidylinositol dimannoside Ac2PIM2 144 144 1

24 Triacylated phosphatidylinositol dimannoside Ac3PIM2 1728 1728 1

25 Tetraacylated phosphatidylinositol dimannoside Ac4PIM2 20736 20736 1

26 Diphosphorylated hexaacyl Lipid A LipidA-PP 15625 15625 1

Total All libraries 119200 212516 50

Covered adduct ions:

[M+H]+; [M+Na]+; [M+NH4]+; [M-H]-; [M-2H](2-); [M+NH4-CO]+; [M+Na2-H]+; [M]+; [M-H+Na]+; [M+Li]+

LipidBlast in silico tandem mass spectrometry database for lipid identification. Kind T, Liu KH, Lee do Y, Defelice B, Meissen JK, Fiehn O.

Nature Methods. 2013 Aug;10(8):755-8. doi: 10.1038/nmeth.2551.

21

LipidBlast MS/MS search with NIST MS search program

using precursor search and dot-product match can be used by practitioners

Experimental MS/MS list Experimental MS/MS list

Library hit scores Library hit scores

exp. MS/MS exp. MS/MS

in-silico MS/MS in-silico MS/MS

in-silico MS/MS in-silico MS/MS

exp. MS/MS exp. MS/MS

Search speed ~ 100 MS/MS spectra per second (without GUI)

22

LipidBlast example: ion trap mass spectrometer

Agilent Ion Trap SL; [M+Na]+; PC 34:1; [M+Na]+; GPCho(8:0/ 26:1(5Z))Head to Tail MF=593 RMF=917

480 510 540 570 600 630 660 690 720 750 780

0

50

100

50

100

478.2 504.2 577.5

599.4

651.3

723.3

723.49409

Experimental

Library

LC/MS Analysis of Bronchoalveolar Lavage Fluid Phospholipids as Biomarkers for Chronic Lung Inflammation;

Agilent application note; 5989-1491EN; Barroso, Bischoff

1st Hit group

PC 34:1

(42 candidates)

Agilent Ion Trap SL/XCT

So

urc

e: A

gile

nt.

com

Name: PC 34:1; [M+Na]+; GPCho(16:0/18:1(11E))

MW: 782 ID#: 42511 DB: lipidblast-pos

Comment: Parent=782.56759 Mz_exact=782.56759 ; PC 34:1; [M+Na]+; GPCho(16:0/18:1(11E));

C42H82NO8P

8 m/z Values and Intensities:

723.49409 999.00 [M+Na]-C3H9N (-59)

599.50155 600.00 [M+Na]-C5H14NO4P (-183)

544.33807 20.00 [M+Na]-sn1

526.32751 20.00 [M+Na]-sn1-H2O

518.32243 20.00 [M+Na]-sn2

500.31187 20.00 [M+Na]-sn2-H2O

467.25401 40.00 [M+Na]-59-sn1

441.23837 40.00 [M+Na]-59-sn2

Fatty acyl side chains (sn1, sn2) best detected in negative ionization mode

23

LipidBlast example : Hybrid Ion-Trap (IT) and Time-of-Flight (TOF)

Source: A Chloroplastic UDP-Glucose Pyrophosphorylase from Arabidopsis Is the Committed Enzyme for the First Step of Sulfolipid Biosynthesis

Y Okazaki, M Shimojima, Y Sawada et al. The Plant Cell 21:892-909 (2009);

Shimadzu's LCMS-IT-TOF

So

urc

e: s

chim

adzu

.co

m

SHIMADZU LCMS-IT-TOF; [M-H]-; SQDG 34:3; [M-H]-; SQDG(16:0/ 18:3(6Z,9Z,12Z))Head to Tail MF=147 RMF=619

250 300 350 400 450 500 550 600 650 700 750 800

0

50

100

50

100

255.009

537.267

Name: SQDG 34:3; [M-H]-; SQDG(16:0/18:3(6Z,9Z,12Z))

MW: 815 ID#: 106150 DB: lipidblast-neg

Comment: Parent=815.49792 Mz_exact=815.49792 ; SQDG 34:3; [M-H]-;

SQDG(16:0/18:3(6Z,9Z,12Z)); C43H76O12S

559.25784 300.00 [M-H]-sn1

537.27348 300.00 [M-H]-sn2

277.21662 100.00 sn2 FA

255.23226 100.00 sn1 FA

225.00690 999.00 fragment C6H9O7S

Experimental

Library

1st Hit

SQDG 34:3

(8 candidates)

24

LipidBlast example : ion trap mass spectrometer

Thermo Finnigan LCQ/LTQ

Finnigan LCQ DECA ion trap mass spectrometer ; [M-H]-; LipidA PP [14/ 14/ 10/ 16/ 3O-(14)/ 3O-(14)]; [M-H]-;Head to Tail MF=719 RMF=916

1320 1360 1400 1440 1480 1520 1560 1600 1640 1680 1720 1760 1800

0

50

100

50

100

1324.0

1454.2

1552.2

1596.1

1698.2

1552.00785

1796.3Experimental

Library

Structural analysis of lipid A from Escherichia coli O157:H7:K- using thin-layer chromatography and ion-trap mass spectrometry;

Chang-Soo Lee, Yun-Gon Kim, Hwang-Soo Joo, Byung-Gee Kim; J Mass Spectrom. 2004 May;39(5):514-25.

2nd Hit

Lipid A (PP)

(16 candidates)

Name: LipidA PP [14/14/14/14/3O-(12)/3O-(14)]; [M-H]-;

MW: 1796 ID#: 64304 DB: lipidblast-neg

Comment: Parent=1796.21157 Mz_exact=1796.21157 ; LipidA PP [14/14/14/14/3O-(12)/3O-(14)]; [M-H]-;

C94H178N2O25P2; LipidA-PP-[R2(14:0)(3-OH)/R3(14:0)(3-OH)/R2'(14:0)/R3('14:0)/R2'-3-O-(12:0)/R3'-

3O-(14:0)]

9 largest peaks:

1552.00785 999.00 | 1698.23467 600.00 |

1796.21157 500.00 | 1498.05715 300.00 |

1470.02587 300.00 |

1596.03405 250.00 | 1568.00277 250.00 |

1454.03095 250.00 | 1714.22959 50.00 |

9 m/z Values and Intensities:

1796.21157 500.00 [M-H]-

1714.22959 50.00 [M-H]-PO3H

1698.23467 600.00 [M-H]-PO4H3

1596.03405 250.00 [M-H]-PO4H3-R2'-O-FA

1568.00277 250.00 [M-H]-PO4H3-R3'-O-FA

1552.00785 999.00 [M-H]-R2 acyl FA || [M-H]-R3 acyl FA

1498.05715 300.00 [M-H]-PO4H3-R2'-O-FA

1470.02587 300.00 [M-H]-PO4H3-R3'-O-FA

1454.03095 250.00 [M-H]-R2-PO4H3 || [M-H]-R3-PO4H3

25

So

urc

e: W

ater

s.co

m

Waters Synapt HDMS; [M+Na]+; PC 32:0; [M+Na]+; GPCho(16:0/ 16:0)Head to Tail MF=393 RMF=635

120 180 240 300 360 420 480 540 600 660 720

0

50

100

50

100

86.0943

146.9827

184.0753478.3405

573.4851

697.47840

697.4801

756.5524

Source: Direct Tissue Imaging and Characterization of Phospholipids Using MALDI SYNAPT HDMS System; Waters 2008; 720002444en

Emmanuelle Claude, Marten Snel, Therese McKenna, James Langridge;

Name: PC 32:0; [M+Na]+; GPCho(16:0/16:0)

MW: 756 ID#: 42167 DB: lipidblast-pos

Comment: Parent=756.55190 Mz_exact=756.55190 ; PC 32:0; [M+Na]+; GPCho(16:0/16:0); C40H80NO8P

5 m/z Values and Intensities:

697.47840 999.00 [M+Na]-C3H9N (-59)

573.48586 600.00 [M+Na]-C5H14NO4P (-183)

518.32238 20.00 [M+Na]-sn1 || [M+Na]-sn2

500.31182 20.00 [M+Na]-sn1-H2O || [M+Na]-sn2-H2O

441.23832 40.00 [M+Na]-59-sn1 || [M+Na]-59-sn2

Experimental

Library

Waters HDMS Synapt

1st Hit

PC 32:0

LipidBlast example : hybrid quadrupole ion mobility

spectrometry time-of-flight

26

The Last Page - What is important to remember:

Fragmentation and rearrangement rules and ion physics can be programmed into algorithms

Abundance calculations are problematic

Prediction of isomer substructures from mass spectra is possible

Works for reproducible mass spectra

A simplified simulation of mass spectra and simulation of fragmentation pattern

is only possible for certain molecule classes

Works only for peptides, lipids, oligosaccharides, alkanes

Does not work for all other molecules

Does not work with complex (side chain) modifications

Validation, Validation, Validation.

Proof must be given that algorithm works for large diverse sets of molecules (n=100..100,000)

27

Literature (236 min):

Mathematical tools in analytical mass spectrometry [DOI]

Metabolomics, modelling and machine learning in systems biology – towards an understanding of the languages of cells [DOI]

Heuristic DENDRAL: A Program for Generating Explanatory Hypotheses in Organic Chemistry [PDF]

Mass Analysis Peptide Sequence Prediction [LINK]

GlySpy and the Oligosaccharide Subtree Constraint Algorithm (OSCAR)

Mass Frontier for further discussion MOLGEN-MS [LINK]

http://fiehnlab.ucdavis.edu/staff/kind/Metabolomics/Structure_Elucidation/

http://fiehnlab.ucdavis.edu/projects/LipidBlast

MetIDB: A Publicly Accessible Database of Predicted and Experimental 1H NMR Spectra of Flavonoids [LINK]

METFRAG: In silico fragmentation for computer assisted identification of metabolite mass spectra [LINK]

Advances in structure elucidation of small molecules using mass spectrometry [Link]

Computational mass spectrometry for small molecules [Link]