simple matching algorithms for library search of organic mass spectra

International Journal of Mass Spectrometly and Ion Physics, 47 (1983) 321-324 Elsevier Scientific Publishing Company, Amsterdam - Printed in The Netherlands

321

SIMPLE MATCHING ALGORITHMS FOR LIBRARY SEARCH OF ORGANIC MASS

SPECTRA

J. HOLLOS

CHINOLN Pharmaceutical Works Ltd. , Budapest, Hungary, H-1045

ABSTRACT

Success of retrievals depends mainly on composition of database and quality of included spectra, however, prefilters and matching algorithms have their influences on it.

INTRODUCTION

Nowadays commercial mass spectrometer data systems contain spectrum

libraries and search facilities ( ref. l-5 ), thus a systematic studv of relation

between matching algorithms and their performances seems to be an important task.

New algorithms ( ref. 6) were developed. As preprocessing, peak intensities

were replaced bv numerical codes. Value of a code is high if the peak is important

one and smaller in case of less important peaks, respectively.

ENCODING OF INTENSITY VALUES

In organic mass spectrometry peak intensities show variations owing to

instrumental and other factors. Matching algorithms which use abundances data

face some problems, e.g. peak intensities of 70% and 8q7$ can be regarded as the

same, while a 2% peak and a 12yc one differ significantlv. A measure for similarifir

mav be the arithmetic mean of abundances of matching peaks divided by the

intensitv differences or it can be obtained by other computations. Choosing a

logarithmic scale of abundances was correct for large peaks, but importance of

small peaks was over-estimated.

Replacing peak intensities bv numerical codes

Peak intensities are considered semi-quantitative, i. e. large. medium-sized

and small peaks are distinguished. A particular peak is regarded as large if it is

more intense than an upper threshold. Small peaks are those which do not reach a

lower threshold. The remainders are medium-sized.

0020-7381/83/0009--0000/$03.00 0 1983 Elsevier Scientific Publishing Company

322

The most important peaks are the largest ones in each peak group, Relativelv

small peaks can be significant at high mass numbers, but in low mass range only

some of the large peaks carry useful informations. Values of both thresholds were

chosen low in the molecular peak range and gradually higher

the low mass numbers.

coming down towards

Features of spectra vary to a great extent. Sometimes a

be found in a spectrum and some of them are not important.

lot of large peaks can

Higher threshold

values were able to select the largest peaks. In other spectra only one ‘or two

large peaks exist, but the most abundant peaks of low intensity groups mav be

significant. In the latter case threshold values were chosen lower.

Furthermore rank-order or relative position of peaks was noted in each peak

group, i. e. which peak has the lst, 2nd, 3rd. etc. largest intensitv. In a peak

group there could be found, for example a large peak as the lst, a medium-sized

peak as the 2nd, a small peak as the 3rd, etc.

In this manner peaks were classified and their intensities were replaced bv high

or gradually smaller code values according to their significances. This encoding

had to be done only once as preprocessing by a properly developed algorithm.

SCORING

A simple forward search took into account every matching peak, summing up

the lower intensity codes. Maximum scores values could be obtained when each

peak in unknowns matched and peaks in the librarv spectra had higher or at least

equal intensity codes. Reference spectra having a lot of large peaks were preferred

as superfluous peaks in references were not regarded.

Reverse search (ref. 7) resulted the same as the forwarded one, except when

a peak was missing from the unknown spectrum. In the latter case the sum of

scores was decreased, according to the intensitv code of peak present in the

particular library spectrum. Reverse search gave good results when unknown

spectrum had more peaks than the relevant library spectra.

Combined search considered both “match” and “no match” cass. Score values

were decreased when peaks were missing either from unknown or librarv spectra,

but thev were present in the other ones. Moreover any intensity differences between

matching peaks were weighed, too. Spectra of minimum differences were retrieved

into the first places. However, the method was sensitive against investigated mass

ranges, because reference spectra of lower molecular weight compounds did not

323

contain all peaks of unknown and vice versa.

All three methods were able to identify various compounds even by reference

spectra which differ from those measured. Retrievals were carried out both on

mass number scale and on neutral loss scale.

PRE FILTERS

In most investigated cases unwished coincidences could be observed, that is

references having similar spectra but different structures got into the first places.

A number of irrelevant references could be discarded by appropriate prefilters,

the hetero atom contents were limited, e. g. retrieving alkyl esters the references

should contain 2 oxygen atoms and no other hetero atoms. Using another prefilter

only references having molecular weight of homologous compounds (MW + n. 14) -

were considered. Best results were obtained applving both prefilters simultaneouslv.

Effectiveness of retrievals depended on the number of similar spectra of related

compounds present in the library.

PARTITIONED FILE STRUCTURE

Some commercial data systems use sequential data base divided into files.

Library spectra are arranged in order of increasing molecular weight (MW ). This

tvpe of file partitioning was investigated, the 1st file contained reference spectra

of MW less than 150, the 2nd file spectra of MW = 150-220, etc.

All investigated unknowns had a MW less than 120, thus the relevant reference

spectra were found in the 1st file. Most of the retrievals showed higher score

values in library spectra of the 1st file compared to the 2nd one.

COMPUTER TECHNIQUE

Computations were done on a computer ODRA--1305 of CHINOIN Pharmaceutical

Works Ltd. Programs written in FORTRAN were stored on disc, data base on tape.

Spectra of unknowns were fed from cards. Investigated compounds were alkyl esters

(methyl, ethyl, propyl esters of formic, acetic, propionic acids), alkyl ethers

(ethyl-butyl, n-propyl, isopropyl, methvl-butyl, ethvl-isopropyl ethers), alkyl

amines (n-butyl, isobutyl, terc-butyl, isopropyl-methyl amines). The best 5 or

20 reference full spectra were listed as output. Each of them was separately

merged into and compared with the unknown spectrum. All matching and remaining

peaks (mass numbers, abundances and code values) were presented for human

324

examination and interpretation of search results.

CONCLUSIONS

Matching algorithms predetermine which reference spectra reach the highest

score values and get into the first places of output lists. However, similar spectra

do not mean similar structures, because unwished coincidences may occur. Most

of the first places are filled up with related compounds only in cases when enough

similar spectra of isomer and homologous compounds are included in the library.

Better similarities were found between homologous compounds of close molecular

weights than between those having considerable differencies in molecular sizes.

Knowing molecular weight and/or hetero atom contents of unknowns irrelevant

library spectra can be discarded.

Sums of scores are high when a. lot of large peaks are present in the

investigated mass range. A few missing peaks among a great many large ones have

relatively small negative contributions. A single missing large peak means a

significant decrease of score value when the investigated mass range is small or

a few large or medium-sized peaks can be found in it.

REFERENCES

1 R.G. Ridley, in G. R. Wailer (Ed. ) Biochemical Applications of Mass Spectro- metry. John Wiley and Sons, Chichester, 1972. chap. 6.

2 G.M. Pesyna and F. W. McLafferty, in F.C _ Nachod, J_ J_ Zuckerman and E. W. Randall (Eds. ), Determination of Organic Structures by Physical Methods, Academic Press, New York, 1976, vol. 6, chap. 2.

3 J,R. Chapman, Computers in Mass Spectrometry. Academic Press, London, 1978.

4 D. Henneberg, Adv. Mass Spectrometry, 8B (1980) ~511-1531. 5 D.P. Martinsen. Appl. Spectroscopy, 35 (1981) 255-266. 6 J. Hollds, Magyar Kemiai Folydirat, 86 (1980 } 321-326. 7 F.P. Abramson, Anal. Chem., 47 (1975) 45-49.

simple matching algorithms for library search of organic mass spectra

Documents