simple matching algorithms for library search of organic mass spectra
TRANSCRIPT
International Journal of Mass Spectrometly and Ion Physics, 47 (1983) 321-324 Elsevier Scientific Publishing Company, Amsterdam - Printed in The Netherlands
321
SIMPLE MATCHING ALGORITHMS FOR LIBRARY SEARCH OF ORGANIC MASS
SPECTRA
J. HOLLOS
CHINOLN Pharmaceutical Works Ltd. , Budapest, Hungary, H-1045
ABSTRACT
Success of retrievals depends mainly on composition of database and quality of included spectra, however, prefilters and matching algorithms have their influences on it.
INTRODUCTION
Nowadays commercial mass spectrometer data systems contain spectrum
libraries and search facilities ( ref. l-5 ), thus a systematic studv of relation
between matching algorithms and their performances seems to be an important task.
New algorithms ( ref. 6) were developed. As preprocessing, peak intensities
were replaced bv numerical codes. Value of a code is high if the peak is important
one and smaller in case of less important peaks, respectively.
ENCODING OF INTENSITY VALUES
In organic mass spectrometry peak intensities show variations owing to
instrumental and other factors. Matching algorithms which use abundances data
face some problems, e.g. peak intensities of 70% and 8q7$ can be regarded as the
same, while a 2% peak and a 12yc one differ significantlv. A measure for similarifir
mav be the arithmetic mean of abundances of matching peaks divided by the
intensitv differences or it can be obtained by other computations. Choosing a
logarithmic scale of abundances was correct for large peaks, but importance of
small peaks was over-estimated.
Replacing peak intensities bv numerical codes
Peak intensities are considered semi-quantitative, i. e. large. medium-sized
and small peaks are distinguished. A particular peak is regarded as large if it is
more intense than an upper threshold. Small peaks are those which do not reach a
lower threshold. The remainders are medium-sized.
0020-7381/83/0009--0000/$03.00 0 1983 Elsevier Scientific Publishing Company
322
The most important peaks are the largest ones in each peak group, Relativelv
small peaks can be significant at high mass numbers, but in low mass range only
some of the large peaks carry useful informations. Values of both thresholds were
chosen low in the molecular peak range and gradually higher
the low mass numbers.
coming down towards
Features of spectra vary to a great extent. Sometimes a
be found in a spectrum and some of them are not important.
lot of large peaks can
Higher threshold
values were able to select the largest peaks. In other spectra only one ‘or two
large peaks exist, but the most abundant peaks of low intensity groups mav be
significant. In the latter case threshold values were chosen lower.
Furthermore rank-order or relative position of peaks was noted in each peak
group, i. e. which peak has the lst, 2nd, 3rd. etc. largest intensitv. In a peak
group there could be found, for example a large peak as the lst, a medium-sized
peak as the 2nd, a small peak as the 3rd, etc.
In this manner peaks were classified and their intensities were replaced bv high
or gradually smaller code values according to their significances. This encoding
had to be done only once as preprocessing by a properly developed algorithm.
SCORING
A simple forward search took into account every matching peak, summing up
the lower intensity codes. Maximum scores values could be obtained when each
peak in unknowns matched and peaks in the librarv spectra had higher or at least
equal intensity codes. Reference spectra having a lot of large peaks were preferred
as superfluous peaks in references were not regarded.
Reverse search (ref. 7) resulted the same as the forwarded one, except when
a peak was missing from the unknown spectrum. In the latter case the sum of
scores was decreased, according to the intensitv code of peak present in the
particular library spectrum. Reverse search gave good results when unknown
spectrum had more peaks than the relevant library spectra.
Combined search considered both “match” and “no match” cass. Score values
were decreased when peaks were missing either from unknown or librarv spectra,
but thev were present in the other ones. Moreover any intensity differences between
matching peaks were weighed, too. Spectra of minimum differences were retrieved
into the first places. However, the method was sensitive against investigated mass
ranges, because reference spectra of lower molecular weight compounds did not
323
contain all peaks of unknown and vice versa.
All three methods were able to identify various compounds even by reference
spectra which differ from those measured. Retrievals were carried out both on
mass number scale and on neutral loss scale.
PRE FILTERS
In most investigated cases unwished coincidences could be observed, that is
references having similar spectra but different structures got into the first places.
A number of irrelevant references could be discarded by appropriate prefilters,
the hetero atom contents were limited, e. g. retrieving alkyl esters the references
should contain 2 oxygen atoms and no other hetero atoms. Using another prefilter
only references having molecular weight of homologous compounds (MW + n. 14) -
were considered. Best results were obtained applving both prefilters simultaneouslv.
Effectiveness of retrievals depended on the number of similar spectra of related
compounds present in the library.
PARTITIONED FILE STRUCTURE
Some commercial data systems use sequential data base divided into files.
Library spectra are arranged in order of increasing molecular weight (MW ). This
tvpe of file partitioning was investigated, the 1st file contained reference spectra
of MW less than 150, the 2nd file spectra of MW = 150-220, etc.
All investigated unknowns had a MW less than 120, thus the relevant reference
spectra were found in the 1st file. Most of the retrievals showed higher score
values in library spectra of the 1st file compared to the 2nd one.
COMPUTER TECHNIQUE
Computations were done on a computer ODRA--1305 of CHINOIN Pharmaceutical
Works Ltd. Programs written in FORTRAN were stored on disc, data base on tape.
Spectra of unknowns were fed from cards. Investigated compounds were alkyl esters
(methyl, ethyl, propyl esters of formic, acetic, propionic acids), alkyl ethers
(ethyl-butyl, n-propyl, isopropyl, methvl-butyl, ethvl-isopropyl ethers), alkyl
amines (n-butyl, isobutyl, terc-butyl, isopropyl-methyl amines). The best 5 or
20 reference full spectra were listed as output. Each of them was separately
merged into and compared with the unknown spectrum. All matching and remaining
peaks (mass numbers, abundances and code values) were presented for human
324
examination and interpretation of search results.
CONCLUSIONS
Matching algorithms predetermine which reference spectra reach the highest
score values and get into the first places of output lists. However, similar spectra
do not mean similar structures, because unwished coincidences may occur. Most
of the first places are filled up with related compounds only in cases when enough
similar spectra of isomer and homologous compounds are included in the library.
Better similarities were found between homologous compounds of close molecular
weights than between those having considerable differencies in molecular sizes.
Knowing molecular weight and/or hetero atom contents of unknowns irrelevant
library spectra can be discarded.
Sums of scores are high when a. lot of large peaks are present in the
investigated mass range. A few missing peaks among a great many large ones have
relatively small negative contributions. A single missing large peak means a
significant decrease of score value when the investigated mass range is small or
a few large or medium-sized peaks can be found in it.
REFERENCES
1 R.G. Ridley, in G. R. Wailer (Ed. ) Biochemical Applications of Mass Spectro- metry. John Wiley and Sons, Chichester, 1972. chap. 6.
2 G.M. Pesyna and F. W. McLafferty, in F.C _ Nachod, J_ J_ Zuckerman and E. W. Randall (Eds. ), Determination of Organic Structures by Physical Methods, Academic Press, New York, 1976, vol. 6, chap. 2.
3 J,R. Chapman, Computers in Mass Spectrometry. Academic Press, London, 1978.
4 D. Henneberg, Adv. Mass Spectrometry, 8B (1980) ~511-1531. 5 D.P. Martinsen. Appl. Spectroscopy, 35 (1981) 255-266. 6 J. Hollds, Magyar Kemiai Folydirat, 86 (1980 } 321-326. 7 F.P. Abramson, Anal. Chem., 47 (1975) 45-49.