peptide sequence determination from high-energy …95)00477-u.pdfpeptide sequence determination from...

Peptide Sequence Determination fromHigh-Energy Collision-Induced DissociationSpectra Using Artificial Neural Networks

Randall E. Scarberry and Zhen Zh an gDepartment of Bium etrv and Epidcmiolog v. Med ical Unive rsitv of South Ca ro lina, Charl est on ,South Carolina, LSA

Daniel R. KnappDepartm ent of Ph arn'dcolog\ ', M ud ica l Lni versitv of South Ca ro lina. Cha rle ston , South Ca rolina, USA

This paper reports a newly developed technique that uses artificial neural networks to aid inth e au tomated in terpreta tio n of pep tide sequence from high-energ y collision-ind uced di ssociation (CrD) tandem mass spectra of peptides. Two artificial neural networks classify fragment ions before th e co m menceme nt of an iterat ive seq ue ncing algo rithm . The first neuralnetwork provides an estimation of whether fragment ions belong to 1 of 11 specificcategories, whereas th e second network a tte mp ts to determine to which category each ionbelongs. Based upon numerical results from the two networks, the program generates anid ealized spectrum that con ta ins on ly a single ion type. Fro m th is sim p lified spectr um, th eprogram's sequencing module, which incorporates a small rule base of fragmentationkn owled ge, d irectl y generates seq ue nces in a stepw ise fashion through a h igh-speed iterativeprocess. The results w ith th is prototype algorithm, in which the neural networks weretrained on a set of reference spec tra, suggest th at thi s method is a viable approach to rapidcomputer interpreta tion of peptid e C ID spectra . (J Am 50l' Mass Snectrom 199.5, 6, 947- 967)

T he use of high-energy collision-induced dissociation (CI O) tandem mass spectrometry for thedetermination of peptide sequence provides sev

era l advan tages over the m ore tradi tio na l method ofEdman degradation [I]. A major advantage is the veryshort dat a ge neration time, which is typi cally on ly afew minutes. Without some automated means to evaluat e th e d at a sets co llec ted, how ever, th e tim e sav ingis of little consequence because subsequent manualin te rp re ta tion of th e dat a ca n req uire hou rs to days ofan experienced expert' s time. The development of reliable comp u ter so ftwa re for ra pid interpretation of C10spectra is, therefore, essential to exploit the time adva n tage of the mass spe ctro metr ic approach to peptidesequencing.

Co m p uter programs to determine peptide seque ncefrom mass spectral data cannot take th e traditionalcom pu ting approach in w hich ,1 11 pos sible seque ncesare evaluat ed because . for all but the spectra of ver ysmall pep tides, the re are lou man y alte rna tives for thesolution to be tractable . For exam ple, greater th an 101

\1

permuta tions of th e 20 com monly occurring ami noacids may accoun t for a nominal peptide mass of lOOO

Address rep r in t req ucst-, tll Daniel R. Kndl' f' . I kp,Hln1l'nt II I l' h .u-rn,icologv, Medical Uni\'L'rsit~ or South (dr\llin.l, Charleston. SC 2q-l2~

2251.

© 1995 Am e rican Soc il'l l for \-1.1" " pl'l t1'lll1ll' l n1044 - 0305 / 95 /5'J. 50SSDI ](}H -()3():;('J~)()()17 :- L

u [2]. Therefore, for a program to analyze a peptide ofany appreciable size w ithin an accep table time period,it must employ some heuristic scheme to significantlyreduce th e aven ues of cons idera tion .

Some early computational approaches limit the possible alte rn a tives nonprogram matically by requir ingadditional input from the user before the programbegins seque ncing . The approach of Hamm et a1. [3]requires the identities of a 11 the amino acids present inth e seque nce. It then ge ne ra tes all possible sequencesfor evaluation from this reduced set. This approachnegat es some of the ad van tages of tandem mass spectrometry, because amino acid analysis must be carriedout befor ehand to determine th e identities of the aminoacid constituents. Even without consideration of theadded effort and samp le cons um p tion en ta iled , priorknowledge of the amino acid constituents reducescom pu tation tim e to an acceptable lev el only if th epeptide con ta ins a small number of amino acid residuetyp es.

More recently, programs have been developed thatrequ ire no ini tial inpu t other than the tandem massspectro metry data. Of these, most limit the number ofseque nces considered by incremental build up of thesequences and periodic purging of all but the mostpromising seque nces from th e collection of partial sequ en ces. A program of thi s ca tegory, SEQPEP, devel-

Rece ived January 23,1995Revised May 12, 1995

Accepted May 14, 1995

948 SC A RBERRY [ T AL J Am Soc Ma ss Spect ro rn 1995 , 6, 947 -961

Figure 1. An artific ial neur al network . The neural networkrece ives external va lues in the form of an n-d imens ional inp u tvector. The layer of input nodes (da rk circles ) d istribute theinp uts to the p rocessin g uni ts (ligh t circles ) of the first hidd enlayer. The outpu ts of the pr ocessin g units are dis tribu ted to thenext layer , and so on , until the outp u t layer. Th is layer emits anIII-dimensional vector whose elements represent class member ship .

that eva luates partial sequences by using much of thefrag mentation knowledge describ ed by Biemann et aJ.[4, 7-10].

The pri mary new features of this progra m are thecombined use of the two ea rlier approaches and theuse of two artificia l neu ral networks (ANNs) to classify the origi nal ions. Such ne tworks ha ve been usedsuccessfully in many va ried applications in recent yea rs[11- 14]. ANNs are nonlinear computational models forinfo rmation p rocessing wi th structures developedbased on certain known prope rties of biological neural sys tems. They have bui lt-in mechanisms for selfadap tation in response to the data environment. Thetype of ANN used by this p rogram maps in a nonlinear fash ion an n-elem ent inp ut vector to an m-categoryclassifi cation output vector. The param eters of thesetwo ANNs are determined by exposur e to trainin gda ta to Jearn the implicit associations between dataelemen ts and classi fication categor ies. This processdol'S not require the knowledge to be in the form ofexp licit ru les and is, therefore, we ll suited for patt ernrecog n ition tasks in which ru les are unknown or aretoo complicated to be efficien tly programmed via atrad itional computing approach.

The neural networks used in our program havemu ltilayer perceptron archi tectures. Such a network iscomposed of a layer of inp ut nod es followed by one ormore layers of processing un its (Figure 1). The inputnodes simply rou te the eleme nts of the input vec tor toeach un it of the first layer of processing units throughwe igh ted connec tions . Each processing un it (Figure 2)of the first layer sums its weig hted inputs and adds abias term to compute a net activa tion signal that issubseq uently applied to a nonlinear activation fun ctionto compute the unit's output. If ano ther layer exists,the first layer's outputs are similarly routed to its unitsthrough mor c weighted connec tions. The process con-

ym

OutputLayer

HiddenLayer

iInputLayer

XI ----:l,.:;~

X 2 --3.~---.p.n~,_"-.,,..~,

oped by Johnson and Biem ann [41, is probably themost successful peptide sequencing program to date.With this program, subsequences are cons tructed oneresidue at a time beginning from the C-terminus. During an iteration , every subseque nce is transformed into20 child subsequences by the add ition of the 20 commonly occurring residues. Following extension, thenew subsequences are scored by correlation with ionsfrom the observed mass spectrum. This correlation isachieved by us ing a rule base to determine which ionssho uld be expected give n the prop osed subsequence .Because the number of subsequences would otherwiseexpa nd explosively in such a sche me, all but the 300highest-scoring subsequen ces are eliminated at theconclusion of each iteration.

Even though SEQPEP is a relatively fast program,deri vati on of sequences depends pr imarily uponknowled ge that the fragmentation process is properl yencoded into the knowledge base. Because human un derstanding of the fragmentation process is presentl yincomplete, SEQPEP must retain a relativel y largenumber (300) of subsequences after each itera tion toprevent the inadvertent elimina tion of the correct partial sequence during purging, which, nevertheless,some times occurs.

The program of Hines et al. [5] takes quite a different approach, in w hich peaks in the spectru m areclassified by using a pattern-based algorithm beforegenera tion of sequences from the results. Each of themore intense of the origina l peaks is hyp othet icallyassigned in turn to each of nine seque nce-specific ioncatego ries (a,,, b; (II ' d,,+ I ' x,,, , YII" 2 ", + 1, v"' t l ' andwm + 1 ) [6] and nine catego ry scores are computed byusing a simple functi on that correlates postulated ionmasses with those actually observed . The corresp onding y ion mass is computed for each category wh osescore exceeds a set th reshold. These Y mass values areaccumulated in a table of " y centers" from whichsequences are then determ ined dir ectly. Clearly, theeffectiveness of th is meth od hin ges up on the ability ofthe scoring functi on to cor rectly class ify the or igina lions. The program attempts to elimina te those peak sthat fall into none of the nine classes simply by exclud ing fragment ions with relative abundances below acons tant thr eshold. A high probability of being in oneof these groups , however, may not necessarily followfrom a large peak height. Indeed, immonium ions an dinternal acyl fragment ions are often am on g the mos tabundant ions in the spectrum, w he reas many of thosepeaks that belon g to one of the nine categor ies arerelatively small.

The sequencing techniqu e presented in this pap erincorporates methods simi lar to both the patt ern-basedapproach of Hines et al. and Johnson and Biernann 'siterative techniqu e. As in the program of Hines et al.,the ions of the original spectru m are first classified anda new ide alized spec trum that consists so lely of y-typeions is derived from the results. The idealized spectrum is then used by an itera tive sequencing sche me

J ' \ 01 SOC Mass Spe c trum 199 5, 6, 947 --'1" 1 PEPTIDE SEQ U ENCII\;G USING N EC RA L N ETWORKS 949

Figure 2. A neural network processin g unit. The uni ts receivesII inputs th rou gh weighted lin ks. It adds to the we ighted inputsum a bias term ((10 " to comp ute its net act ivation net .. Th is netactiva tion is used as the variable for the sigmo id al activationfu nction to compu te the un it ou tpu t )1..

Input to unit:

Nonlinear activation function:

Output ofunit:

"net} = Wo} + LX' W i}

,-I

j(x) = 1+ e- lk

y = j(net)} J

equa l to the number of possible class ification categories and each element is set to a value near 1 tosignify pos itive membersh ip or to a value near 0 tosign ify nonmembership.

The training method used in our program is basedon the generalized delta ru le described separately byWerbos [18] and Rumelh art and McClelland [19]. It isessentially a gradient descent algorithm to minimizenetwor k mean -squared erro r. The weights and biasesof the network are ini tialized to small random values(typically between - 0.5 and 0.5), and the networkthen proceeds through a large number of trainingsessions or epoc hs . During an epoch, each of the training pairs is individually p resented to the network and,by comparison of the actual network outputs to thetarget outputs, small weight corrections are computedto minimize the square of the difference (error). If atthe end of the epoch, the network mean-squared errorhas not been reduced to an acceptably low threshold ,the we ights are adjus ted by the cumulative weightcorrections and anothe r epoc h is begun. When tra ininghas been successfully concluded, the weight values arefixed and the network ma y be used to classify unknown pa tterns.

Methods

tinues th rou gh the final layer. The outputs of th is layerare representations of the network's answers to theclassi fication probl em .

Networks with no interm ed iate layers of processingunits be tween the input nodes an d the output layercan only be used to d istingu ish linearly separabl eclasses [I5]. The addi tion of one or more " hidden"layer s, as they have come to be termed, gives a network the ability to perform nonlinea r mappings frominput space to output space. In fact, a neural networkwith a sing le hidden layer may. in theory, impleme ntany funct iona l that maps be tween two real-valuedspaces, provided that the hidden layer contains a sufficient number of un its [ 10, 17]. (However, not a llnetworks ma y be trainabl e in a fini te time per iod .) Aseco nd hidden layer is sometimes incorporated in anetwork to enhance training efficiency.

The kno wled ge of an AN I is en tirely conta ined inthe va lues of its weights and biases. The tra ining of anANN is a process of adjusting these param eters untilthe network yie lds appropria te resu lts by usin g a set ofrep resenta tive train ing patt ern pa irs. A training pat tern consists of an input vec tor paired wi th a targetvector. The elemen ts of the input vector are actua lmeasurements and attributes of the classificati on sub ject code d into numerical form, wher eas the targetvector is the desired response from the network wh enit is pr esented with these va lues . The target vec toreleme nts are the representati ons of the known classmembersh ip . The target vector length is, therefore.

The program is struct ured consec utive ly into thr eemajor sections : (0 neural netw ork classification, (2)cons truc tion of an ideal ized spec tru m, and (3) der ivation of sequences (Figure 3). The algorithm's founda tion is the class ification of fragment ion peaks by twoar tificial neural networks in the first module. Thesenetworks provide classificati on measures that the second module uses to cons truct an idealized spectrum.The third and final module cons truc ts sequences in astep wise fashion directly from the idealized spectrumby using the existence of idealized peaks to determinewhich am ino acid residues to add in each incrementation step.

Neural Network Classification

A mass spectru m is read from an ASCII file thatcontains peak masses paired with associated relat iveabunda nces. Typ ically, the spec tra used containedmasses record ed with precisions of 0.01 u and accuracies of 0.2 u. Such high qua lity of mass determinationmakes it possible to store the peak information as anarray of data structu res indexed by mass. As the spectru m is read , each peak 's mass is tran sformed into aninteger th rou gh multiplication by the factor 0.9995 an droundi ng [5). The represe ntation of masses as integers,in addi tion to simplification of data storage, facilit ate sthe mass comparisons between peak s, which are performed later in the seque ncing module. The con stant0.9995 approxima tes the reciproca l of the avera ge massexcess for the 20 commonly occurr ing amino acidresidues.

950 SC A IW ERRY [T A L. I Am Soc Mass Spec tro rn 1995, 6, 94 7 - 961

The denominator va lue was selected to force the final

In (/l i + 1.6 lIz =

I In(100 + 1.6 )

Figure 3. The ove rall structu re 01 the sequencing program thatshows the three p rogram modules. The first module co m p u tesclassification measures for the individua l peaks of the origina lspec tru m. The second module uses the results to construct anidea lized sp ectru m that consists of on ly .II-typ e ion s. Finally , thethird modul e de rives the sequl'nces in a step wise fas hion bvusing the existence of peaks in the idealized spectrum to d irectthe ex tens ion 01 incom plete seq uences.

Once stored as data struc tures, the pea ks undergo atwo phase transformation before they arc classified bythe neural networks. The first step simp ly multip liesthe peak heights by the cons tant appropriate to givethe height of the protona ted molecu la r ion MH~ arelat ive abundance of 100.0. In all spectra on whichthis program was developed, the MH - ion is the ion 01greates t height; therefore, the scaling cons tant used issim ply 100.0 d ivided by the ma ximum pea k he ight.The second step is a loop in which the scaled peakheights h i are transformed into their fina l values h ,accordi ng to the formu la

peak heigh t of the MH + ion to be exac tly 1.0. Becausethe MH + ion has the largest heigh t, the res t of the finalion heights fall into the in terval [0.1. 1.0]. 111is logarithmic tran sformation helps to compensate for the largeheight d ispariti es that othe rw ise wo uld exist betweenthe m ajority of frag men t ions and the prominen t MH +

peak and other large ion s that often occur at the highmass end of the spec trum. The peak height values areconfined to a subinterval of [0.0, 1.0] becau se pea kheights later com pose a significant fraction of the neural networ k in puts. (In general, it is a good practice toconstrain the neur al network inputs to a sma ll interva lto prevent nu merical over flow in network computations .) The constan t 1.6 in the transformation was selected by trial and error to cause the fina l peak heightsof typical spectra to be we ll di str ibu ted throughout theinterva l of values. The ef fect of this second stage ofpre processing on the peak heigh t d istributi ons can beseen clearly in the center graph of Figure 4.

Fina lly, the two neural networks of the firs t moduleare ut ilized . The purpose of these netw orks is to compu te ion classification measures from which the secondmodule may construct the idea lized spectrum. For agiven peak P, the first neural network has a singleoutput elemen t ( i n that is ind icat ive of w hether or notP; belongs to one of 11 ion catego ries. The secondnetwork ass igns to Pi 11 correspond ing class membership scores 1'", i = 1, . . . , 11. Both of these classification

EndDerivation ofsequences

Idealized Sped rumconsuucsoo

Neural NetworilClassi fa tion

Begin

1.0

0.5

100 200 300 400 500 600 700

300 400 500 600 700

1.0 PRE PRO GLU VAL PRO TYR' -. .' . '

, -- .-0 .5 - .. - ... ... -

~ - ' ---SOD 600 700

0.5

1.0

Figure 4. Three \ ersions 01 the spectru m lor the peptide YI'VEI'F. On top is a plot of the origina lspectrum .lmagmfled by I tl), The cente r shows th e- same spectrum after logarithmic tran sforma tion .The ide alized sp,'dru m is shown at bottom w ith .) d otted tr.ico for the true sequence tha t connectspromInent 1/ Ions.

I'L I' I IDI' SFl..!L"ENCI '.;C L"S I ~C "I EL"I{A I N ET W O RK S 951

Io n m ass

net works operate O il the sa me input vec to r form at. Theelem ents of the inpu t vertor Me va lues that are assumed to be pertinen t to determine whether the ionunder considera tio n is on e of the classifica tio n ca tegories and, if so , to which of these ca tegories the ionbelongs, (See further discussio n that follows .)

The classifica tion ca tegories referred to Me the s ixma jor ion types ( 11 " , {J" , ( ,,, r ,,,, .iI"" and =," + 1) thatresul t from cleavages along till' pep tide backbone andthe ion s d" , I ' i '"" I ' and i I'" , I tha t resul t from silkchai n losses. (Th ere ar e I I cillL'gories ra ther than Y,becau se d ll , J and iI',,, , I ions arc computed by usin gbeta subs titue nts eq ua l to bot h H a nd CH ,.) /-\ familythat cons ists of all of these ion tvp cs ca n be associatedwith a spec ific clea vage locat ion a long the peptidebackb on e (Figure :;), All the ions in the fam ily arecen tered abo u t the cleavage loc.i t ion that givcs rise tothe Y," and {'II ions . From till' ion str uctu res illus tra tedin Figu re 5, one m,ly e,lsilv determine the mass rela tionship bet ween ions in a 1.1Jl1il\ ', For cxamplo. till' Y",

ion d iffers from till' .'1' /1 iou in tha t it possessl's ,111

ad di tio nal ca rbon and ll x~"f;ell atom. bu t does not rl' ta in the two pro tons, lis lllilSS is, therefore, the mass ofthe 31,,, ion plus 26, The second co lu mn of Ta ble I gi \'l'sthe Simple formu las that relau- ion family masses tothe Y," ion mass. The subscri pts II and III are used inthe sa me ion fam ily bccnu-«: !\ -krrninal ion s II " , b.: 1-",

and d ; _ I wi ll, in ge neral. have ,1 d ifferent subscri ptfrom the C- term inal ions of till' -a rn c family . The twosubsc ripts w ill a lwavs ,1d d tll the to ta l nu mber ofami no acid resid ues that Cllm pOSl' the pep tid e,

If an ion at a gin'n m,15S loca tion were postulated Inbelon g to a given c1assific<1 ti(ln, the ot he r ion s in till'sa me fami ly could be com puted b~' us ing the Iorm ula-.of Tab le 1. However. if the ion were assigned to ,1

di fferent classifica tio n, an en tin-lv d iffe rent se t of ionfamily masses woul d result, 1-01' cxa rnp le . for a pe ptidew ith an MH I ma ss of ~-D u, i! .in inn wi th a no mina l

Tab le 1. I " n ,; lh ,lt cu n , t i t llh' , I d (".11 ,I ).:, ' l .l n 1l11

IonType

RbRb

CHRa CHRaI

-CHR-CO NH CH CO N CH CO-NH-CHR-n m

+H

+2H Zm ++H x

mYm

+H W m + 1

Vm + 1

Figure 5. 1111' nine tvpes of fragm entat ion io ns th at compose anlun t.uni lv.

mass of 3:;0 u is ass umed to be an ion of type a", theiO J1 tarni lv show n in the top of Figure 6 results, w he reasthe ion f.im ilv in the bottom of Figure 6 res u lts if th eion is assum'ed ins tead to be of type x' lI" Th us, byhvp otheti cal ass ignment of an ion to each of the 11 iontypes, 11 d istinct hyp ot heti ca l ion fam ilies can be derive-d . (!\ll te tha t 11 types res u lt by consid ering R" = Hand CH, for d and w ions.)

If one were then to atte m pt to manua lly classify theion , ,1 likely method wo uld be to co rrela te eac h ofthl'sl' hypothet ical ion fam ilies to ions ac tua lly presen ti ll the spect ru m, One might conclude that the ionbl' longs either to the hypothet ica l class that best correl,lIes vvith the actua l spectr um or to non e of the classesif ,111 co rre late poorly. This rea so ning was used tost ructure the inpu t vec to r for ma t for the two neu ralnetwo rks. The majority oi the inpu t vector elementsart' the ac tual peak heights a t the mass locat ion s thatcorrespond to the hyp othetical ion fam ily masses. Thenet wo rks may, therefo re, " observe" the hypothetic alcl,lSS corrl'l,lti(lns the mselves from the ac tual d at a.

Io n st ructu re

- - - - - - - - - - --- -_.._ -.

W m · , m y+ 53 -' R"Xm nI , - 26

Y", m,z m + 1 m, - 16

Pa rent M H

an MH my 27

b n MH - my -~ 1

C Ol MH - m y- 18

rill 1 MH - m,- 4 3 - R"

m y ~ 55

[H - INH- CHR -CO) " 1 - NH - CHR" - CO - NH- CHR,,,-CO - (NH - CHR- CO)m - 1 - OHIH '

H - (NH -- CHR- CO)" 1 - NH = CHR..

H - (NH - CHR - CO)" , - NH - CHR..-CO

[H-(NH-CHR- COI .. - NH 2 JH ·

~ H R ..

IH - I NH-CHR-COI" - NH-CH1H

NH =CH -CO-(NH- CHR-CO )", -OH ]H

CHRII '

I CH -CO -( NH- CHR - CO )", - OH ]H

CO-(NH-CHR-CO)n,-OH

I H - (NH-CHR-CO)", -OH] H

['C HRm-CO-(NH-CHR- CO)m 1 - o H1H '---- ----- ---- - - - --- --- - -

" M H refl rP. sen t s th e to t a l pe p t id e ""' S S ; R , IS th H beta s u bst it u e n t m a ss Th e subscrip ts n an d m odd to th e se q u e n c e length o f t h epeptide

952 SCARBERRY I"I A I .J A m Soc \1 ,l SS Spectrorn 1995, 6, 947 -,961

Figure 6. Two IOn Iarn ilv interpretat ions for a ll ion w ith ma - e350 u from a peptide with M H ' th'l t h,l~ Kl h 1I. The top ~how~

the ion family that res ults if th e ion I~ cons ide red to be all iI "

type. The entircl v differen t familv of the bo ttom diagr~m n -sult-.when the ion is assumed to be in till' x categ ory. (For simp licity, only the d: and ,I'" ions th at u-e R . = H are shown in bothdiagrams.)

Also deemed pertinen t to ,1 peak's clas sification isits proximity to either end of the spectrum. Nonsequence-specific frag me nts tend to be located moreprominently in these regions. Therefore, the input vector also con tains tw o measur es ind icative of the closeness of the peak to the low and high mass ends of till'spectrum . Each of these mea su res is scaled linearl y tocorrespond to a 0-200-u range. For example, if the ionpo sse sses a mass of 100 u, the di stance measur e thatindicates the proximity to the low mass end will be 0.::;;if the ion mass is 200 II or larger, the d istance measurewill be 1.D. Again, because these measures are neuralnetwor k inputs, the proximi ty measu res ar e constrained to the interval [0.0,]0) . The upper bound of200 u was introduced bec ause it was reason ed that forions located more than 20tl U irom the endpoin t, theexact proxim ity bears litt le relev an ce to classification.

The rest of the input vector clemen ts are s ta tis ticu]measures that characterize the overal l spe ctru m brcause it was tho ught th at the nature of the spect rumm ight have some bearing on fragmen t ion classitication , Tw o such measures are the mean pe ak height andth e peak heigh t standard devia tion . Also incl uded <Hl'

the peak mean mass, the total num ber of peaks d ivided by the length of the sp ectrum (pea k de nsity),and a " cen ter of heigh t" III II gi\"L'n bv

400300200100

0.5

1.0

Construction of Idealized Spectrum

Th e ne ura l netw ork clas sification resu lts ar e the rawmaterial used to construct the idealized spectrum. Theid ealized spectru m is composed only of y-ty pe ions,but any of the other ion types just as easily could havebeen used because seq uences are derived by obs ervation of mass differences between adjacent ions of thesame type. The idealized spectrum is represented in aseparate array of data structures with a length equal tothat of the original sp ectrum. The new spectr um isbuilt by once again performing a loop through thepeaks of the actual spectru m to comp ute con tribu tionsfor the initially empty idealized array. This process isdescribed in the following lines of pseudocod e:

For eac h peak, Pi' do:For each classification category, i, do:

Calculate mass, 111'1' of associated y ionIf Ill " is valid, the~ augment height in idealized

spectrum at index Ill ; ; by (C;o *C)endloop:

end loop;

Finally. the input vectors contain 20 elements represe nta tive of 2 histograms each with 10 division s. Thefirst is a histogram of the peak height distribution; theother is a his tog ram of peak masses multiplied by theirassociated heights. The latter histogram provides thenetworks with an ov erall sh ap e profile of th e spectrum. The transformed spectrum of the tripeptide MRFand its shape histogram are sh own in Fig ure 7. Both ofthe histograms are normalized so that th e sum of theirelements is 1.0.

Th e classification itself is pe rformed in a single passthrough the array of spectral peaks. For each peak,other than the MH + ion, the input vec to r is const ructed and pr esent ed to eac h of the neural networks.The single ou tp u t of the first ne twork along with the11 output elements of the second network ar e stored aspart of the associa ted peak data st ructures in the ar ray.

563

d n+'

492 520 537

2..: il l II\ . I_ ,I

111 11 =

350 378 395 421 450 466 492 520 521

a n b n end n+1 1 ~1 Y m X m W m+1 V rn+1

308 324 350 378 379

z ~1 Y m X m W rn+1 v m+ 1

10

0.250.200.150.100.050.00 ~l':"':'c....c..lf:'-'-'.-l.j,'-'....:..J.J;-:--:....:.J~...:....:!:!c...:...:..~-'-'-+.:....:...J~~;;i-'-:..J

Figure 7. lr.msforrned spectrum (top) and shape histogram(bottom) for th e tr ip eptide :viRf'. The eleme nts of the sh ap ehisto gram are used as ne ura l ne twork inputs to convey thegl'nl'r,l ] hl'ight di st ribu tion of the spectrum.

where 111; and II represent the mass and peak heightof the ith fragm en t ion. This term is similar to thecenter of mass of a set of discrete particles arranged ina straight line. Howev er, in this case, the location s ar especified by the peak masses whereas the hei ghts arethe weigh ting factors ana logo us to the particle mass es .For their values to be in tho interval [0.0, 1.0], the meanpeak mass and ni; ar e both d ivided by the length ofth e spectru m, \1H I .

I'I -I ' II DI - SEQ U E\lCI NC USIN( ; !\:fU ({AL N ETWORKS 953

In the inner loop, ("I and [ represent the results ofthe neural net wor ks. [!I I is the outp ut of the firstnetwork for peak P" and ( through (,II are theoutp uts give n by the second network.

Within the inner loop, a valid.itior, check is used toreduce the overa ll number of ideal ized peaks, A peakat mass l1l i l is not incremented it it is impossible for ay pea k to exist with tha t mass. A mass of 60 u wouldnot be incremented . for ox.irnp!«. because there is nocombination of amino acid res id ues that can yield sucha mass. This check is performed quickly by using asta tic lookup tab le constr uc ted b\ another program. (Itshould be noted that this initial implementation of thep rogram is des igned to de al only w ith unmodi fiedpeptides; future extensions 01 the program will requireexpansi on of the looku p tabl c.)

Each original peak thu s contribu tes to as many as11 peaks in the ideal ized spectrum. that is, the y peaksth at correspond to th e ori ginal peak are of any of the11 peak types. Idcallv. however. for a peak that belon gs to one of the fragm entcltioll ca teg ories, one contribution will be significantly larger than all the rest. Ifthe neural networks have pe rfor med the peak 's class ification well, the peak's contribution to a given idealized peak is propo rtio na l to two factors : (1) the likelihood that the peak I', is not an extraneous peak asexpress ed by e iil an d (2) th e likel ihood th at the peak isfrom the class used to derive the mass 111" of till'idealized peak as expressed by 'i " (Str ictly speaking ,the neural network outputs are not probability measur es; becau-«, they are confined to Ilk' interval [0.0,1.0], however, it is convenient to view them as such.IEven though the idealized spectrum will conta in manypeaks, the cumulative effect from networks with reaso na bly accur ate classi ficatio n results is the predominance of a relative few. As a final step, the idealizedspec tr um is sca led to the in terva l [O.D, 1.0]. The ideal ized spectrum of the peptide, YPVEPF, is shown in thelower graph of Figu re 4 with a dotted line tra cing overthe true sequen ce.

Sequence Dcrination

The seq uencing module, which is di agr amed in Figure8, is the most programmatically complex of the threemajor section s. Both comp lete and pa rtial seque ncesare stored in a variable length list, which is initializedwith a sing le empty or null seque nce. The algorithmthen performs a loop in which sequences arc incrementally built up and exit when , at the en d of an iteration ,all sequences in the sequence list account for the massof the pept ide. Amo ng the 20 commo nly occur ringamino acids, there are two pairs of residues with thesa me nominal masses: the isom ers isoleu cine andleucine, and glutamine and lysine, which are isobaric.Within the seq ue ncing loop, leu cine and isoleu cine arenot distinguished separately; they are represented by asing le extens ion placeh old er Lxx. Glutam ine and lysine are , however, treated separatelv because of dif-

Figure 8. FI,,\\' diagram of the sequence derivation module. Thes" l.'d sl'qll l'n cl' tha t conta ins no am ino acid residues is providedto st.ir t th t ' first iteration of the sequencing loop. The incomplete,l·'IUl.'nccs are exten ded by each amino acid for which there is asuppnrting II-ty pe ion in the idea lized spectrum. The extendedSl'qUl'IKeS drl' assigned scores that allow the list to be sorted indl's ct'ndi ng order . At the conclusion of eac h itera tion all but thelop 5ll sequences are purged. The loop is exited when all seq ucnc('s in the list hav e masses that match the mass o f thepl'Ptide. The program then differ entiates between isoleucine andle-ucine. resor ts the list, and displays the seque nce list.

Icrences in the manner in which they influence fragmentation. More details that concern the way this pairof amino acids is handled are given subseque ntly.When all seque nces in the list are complete, the program exits from the sequencing loop and attempts todifferentiat e between leucine and isoleucine before thefinal results are displayed.

Seque nce extens ion is done a sin gle amino acidresidue at a time from the l\j-terminu:;. Each sequencing iterati on first ex tends the incomplete seque nces inthe sequence list. An incomplete sequence is removedfro m the sequence list and transformed into 19 childsequences by the addition of each of the 19 amino acidresidue masses on its C- term inus . A give n child sequence is discarded unless it satisfies one of two criteria: (1) it is a comp lete seque nce or (2) th e latest aminoacid extension is supported by the existence of anidea lized II peak. The child sequ enc e is comp leteaccounts for the overall mass of the peptide-if itsmass plus 1') (th e combined ma ss of the C- and Nterminal groups) is equal to MH ". If it is not complete,the mass for the idea lize d y peak that results from thelatest extension is computed easily by subtracting the

'154 ~C \ I{ J) II\I\ ) I 1\1 1 ,\ 111 S" ,· M,lS' Spec tru m 1'!'!5, 0, '!47-% 1

child scqucnces mass trorn MH II, in th« idealizl'dspectrum, a .tl pe a k exists ,It th i-. 111 ,1SS , the sl'qul'ncl' isadded to the seqUl'IKe list.

Becau se the Sl'qUl'nCl' li-t is pe riod icallv f.1urged II I

all but the most promising sequences. newly extendedseq uen ces mu st be g i\'e n so me measure of their wor th,This is achieved bv ,lssignnwnt of a score to eachnewlv extended seq uen ce ,1S it is adde d to the Sl'quence list. A sequence's SCllrl' S IS a linear cornbinalion of th ree va ria bles gi \'en b\

S 'C•• l" jl' 4 l , ; " "[ ; } '

where c ;, (> and c ~ Ml' llln~tdilh. IhL' simplicity III

th is eq ua tion is decep tive. LWC,lUSL' ,1 grea t deal ofcomplexity is hidden in the computations of the thrueva ria bles . These variab les ,HL' (1) till' ideal ized pea kheight ratio 1'", (2) the origin,ll f.'l'ak height ratio 1',

and (3 ) th e pea k p resen ce rat io I , .

The idealized peak 11l'ight r.it io is ,1 measure of howwell the seq ue nce is corrobora ted bv info rm ati on in theidealized spectrum. Bcc.iu-«- the idcalized spectrum i~

built solely fro m the out put s of tlw neu ral network- ,this variable, in essence, L'xp rl'ss l'S the de grel' to whichthe neu ral ne twor ks agrel' w ith till' sL'quL'ncl' in tcr p reration. Th e formula for r IS

1/ .1

The comp utation of 1' , tak es in to acco un t the ide,lli/ ed1/ peaks that match the l'XpL'ctl'd If peaks, gi\"l'n thoproposed seq uence , and it also takes int o acco un t thosethat do not match. lfm is the ,l\ 'l'r,lge height of pe'lksthat corre la te with the sequence , and .'1""1 is the av erage height of the peaks that the seq Ul'net' paSSL'S mer.For exa mp le, !l in to r the' sl'qu e1lCL' show n in the 10wL' rgraph of Fig ure -I is the a\L'Idgl' height of the ii\l'ideali zed peaks connected bv the ~equencl' trace, .1/ " 11 1

is the average height of the remaining peaks. Crill'reason that 0,1 is added to till' denominato r is toprevent division bv zero errors on the occasions whenY oU ! equals ze ro .)

The far more intricate computations of f' l and f/ <Hepe rfor med in tan dem . Both of these are measures ufhow well the origin,Il spectrurn nlrrel<ltes with till'p roposed seq ue nce. r, is till' ,1\'prag L' heigh t o f peaksactually present in the original spectrum that <He L'Apeered to exist give n the sequen ce. The term 1'" is thofraction of these expected peaks that are present. Inthese comp uta tio ns , an ex pec ted peak is not consid ered to be present unless its height exceeds ,1 sl'tthresh old .

The difficulty in computing 1', and 1',. stems trorn

deter mina tion of w hich pL"lks to L'X~WCt. 'I'll acco m plishthis task, the program emplovs ,1 rule bast' that ernbodies mu ch of the fragmentation know ledge described inthe literature, When a sequencL' is scored , an arrav offlag variablcs -e-th« ex pected Ion MraV '- is used torecord whether a peak ot ,1 giH'n cat~gory is to beex pec ted at each mass location . For c-x.imp !«. if in

accordance with the rule base a b" ion is to be expected at mass locati on 124 u in the ori ginal spe ctru m,then a lJ ion nag will be set at index 124 in this array,(Becau se of the programm ing tech niques used , m orethan oIll' category of peak can be nagged for a givenmass locati on.) In eac h seC] llen Ct> scoring , all se t flagsare initially flushed from the array. The scoring procedure iterates through each of the proposed sequ enc e'scleavage sites and sets the ion nags as appropriate.After a ll of the flags ar e se t, it is a simp le matter forthe program to loop through each mass location in theex pected ion array to check the or igina l spectrum for apeak whenever it encounters a nag.

The sequence-s pec ific ion s for w hich the programchecks to score a sequence are the categories a", b", ell'd" , ( 1,,, . , ,, ,,, , .\' " " 11,'1' }I" , - 2, = 111' and 2 ", + 1. For agi\Tn cleavage along the peptide backbone, the ions tobe flaggl'd in the expe cted ion array are determined byusin g the following rules:

1. Ions of type 11 " are not nagged when the cleavagesite is N-term inal to a basic amino acid (arginine,histidine. lysine) unless there is another basic aminoacid located fur ther tow ard the N-terminus of thesequence [4], Also, a,; ions are not nagged when thecll><w,lgt> si te is C-ter mi na l to g lycin e '[9].

-, An ion of type c; is tlagged only if the associated (/"and /'" ion are both flagged and actua lly p resent inthe spec tr um [4] and if the amino acid C-terrninal tothe cleavage is th reonine, tryptophan, lysine, orse rine [10].

3. An ion of the d; ca tegory is nagged only if theassociated tI " ion is nagged and observed in thespectrum, In addition, the amino acid on the sideN-terminal to the cleav age cannot hav e a side chainwith less than two carbon atoms (glycine and alanin e), is not aromatic (ph en ylalanine, ty rosine, hi sti dine, tryptophan), and is not cyclic (proline) [4, 7],

4, The only cleavage sites for w hich }l1II ions ar e notflagged are those Cvterminal to proline [9],

5. The .1/ 11I-2 ion is on ly nagged for cleavages Nterminal to proline [9].

h, The : 'II an d : 11I + 1 ion s are flagged unless th ecleavage sill' is :\I-terminal to proline [4].

7. For ions of type " '" to be flagged , a basic amino acidmust be located in the sequence C-terminal to thepoint of cleavage. Also, the C-termina l ami no acidextension must be an aromatic amino acid (phenylalanine, ty rosine, hist id ine, try ptophan), as particacid, or an amino acid with a f3 substituent (serine,valine, iso leucine, th reonine) [4, 7].

H. '('m ions are tlagged if a basic amino acid is locatedC- term ina l to the point of clea vage and if the Cterminal amino acid is not aromatic (phenylalanine,tyrosine, his tidi ne , tryptophan) and is not an ami noacid with less than two carbon atoms in the sidecha in (gly cine, alanine ) [4, 8],

Y, Ion s in the categories /J I! and X III are na gged foreve rv cleavage' site.

PEPTIDE SEQU EN CI N C USINC :'\IEURAL N ETWORK S 955

For an incomplete sequence, the rules that requireknowledge of the amino acid on the C-termina l side ofthe cleavage cannot be completely implemented forthe cleav ag e site associated with the final amino ac idextension. In general, the ions whose occurrences depend heavil y on the presen ce of particul a r am ino acid sC-terminal to the cleavage are not flagged for this site(c", .'/," 2' 1'"" and il ',,, ). Ions of the typ es all' : ,,,, and2", + 1, however, are flagged if the)' satisfy the otherrul e stip ulations.

Unlike the d " and il' " computations in constructionof the neural network input vec tors, wh ich use twobeta substituent (R) values of 1-1 and CH 1 , the scoringprocedure uses the R" substitue n ts ap propria te fo r theamino acids adjacent to the cleavage site . These areOH for threonine, C 2 Hc, for isole uc ine, CH, for valine,threonine, and isoleu cine, and H for the amino acidsnot branched at the betel carbon. The reason for thisdifference is that sequen ces are scored with a priorknowledge of the amino acids involved in the proposed seq uence . The input vector s are constructedwith no knowledge of the amino acids so, not toenlarge the inpu t vec tor inordi narelv, only the tw omost common values for R, are used.

In addi tion to the seq uence-spec ific ion typ es already described, the scoring process also flags theexpec ted ion array for the p resence of so me othe r ioncategories. These are the amino acid immoniumion s -e-ion s that result from the loss of sing le aminoacid side chains-s-and internal acyl fragment ions. Thefirst two of these are especially helpful as indica tionsof the presence of a particular amino acid in thepeptide [1].

An additional explanation must be given with regard to the ma nner in which seque nces are extende dby glutamine and lysine. The fact that lysine is a basicamino acid results in its presenc e having a d ramati ceffect upon which part of the cleaved peptide willreta in the positive charge . Its pr esen ce C-terminal tothe cleavage results in an increased likelihood that theC-terminal cleav age ions (.'/"" x ,,,, 2,,) will resu lt orvice ve rsa if it is located "l-terminal to the cleav age .Both glutamine and lysine are, therefore, included inseque nce ext ension , but, of the two new seque nces, theone with the lower score is immediately discarded toreduce the populat ion of the seque nce list. In ea rlierversions of the program in which both sequences werere ta ined, it was obse rved that the tw o seque nces usu ally, with occasional noteworthy exceptions, had al most identical scores.

When all the incomplete sequences have been replaced by newly exten de d and sco red seque nces, allbut those with the 50 highest scores are deleted. Ifinco mplete sequences are still pr esent among the remainder, the sequence extension loop repeats; otherw ise, the algo rithm breaks out of the loop and p roceeds with the differentiation of leucine and isoleucine.

Because leu cine and isoleu cin e hav e d ifferent betasubstituents. the placement of their observed ll'", and

£I" ions may be used to distinguish between them.leuc ine's single beta subs titue nt is a hydrogen atomwhereas isoleucine has both CH 3 and H. Because boththe ([1111 - and dll-type ions invo lve the loss of a singlesubstituent, the masses at which these ions are foundcan be computed by using the equations of Table 1.For each sequence amino acid labeled as Lxx in thesequenci ng loop , the differentiation step compu tes tw omeasures. The first, h Leu ' is the average height of thetw o peaks in the orig inal spec tru m found at the massesfor the dll and W il l ions computed by using H for R n·

Sim ilarly, the second, h llc ' is th e average height for thetour peaks computed by using both CH 3 and CzHs forR", Then the ratio of these two measures is comparedagainst a constant threshold TLeu / lie as follows to arrive at a classification decision:

Ifi .:

> TLeli /Ilel classify as leucine.i:

Ifi . :

< (TL,' " / lI e ) -- I , classify as isoleucine .!Ilk

If

classify as indeterminate.

Results and Discussion

A collection of 43 precla ssified referen ce spectra wasused in the development and testing of the prototypeprogram . All but one of the spec tra were generated onthe JEOl HX110jHXIlO tandem mass spectrometer(JEOl, Peabod y, MA) in our lab oratory. [The spectrumfor DVVlVDAGLK, generated on a Kratos Concept IIinstrument (Kratos Analytical, Ramsey, NT) was obtained from the University of California, San FranciscoMass Spectro metry Resource.]

The object of experimentation was to classify eachof the reference spectra multiple times by using unbiased neural networks, that is, networks that actuallyhad not been tra ined on data from the spectra underexa mina tion . These aims were accomplished by performing three independent testing runs, during each ofwhich all of the spe ctra were classified once. To prepare for a given testing run the list of spectra wasorde red randomly and then di vided into four ap p roximately equal partitions (II, 11, 11, and 10 spectra).Networks to test the spectra of a given partition weretrained on training vector pairs constructed from datain the other th ree partitions of the given run, that is,the spectra in partition A were tested by using networks tr ained on the d ata fro m partitions B, C, and 0pooled as a single training set; those in partition Bwere tested by using networks trained on the datafrom partitions A, C and 0, and so forth. These fourtests cons titu ted a " run ." Because three independentruns of four partitions were performed, it was neces-

956 SC A I{BE RRY ET A L. J A m Soc Ma ss Spectru m 1995 , 6, 947 961

sary to train 12 independent neural network pairs on12 separate sets of training data .

The data sets for training these neural network pairswere derived by using a separate C language programon a Zeos International computer with an Intel 80486CPU. Given the known peptide sequence, the programcreated two training data files for each spectrum-onefor each of the two network architectures. These datafiles, written in ASCII format, were then ported to aUNIX workstation (DEC, Nashua, NH) where theywere combined as appropriate to construct the trainingdata set for each set of neural networks.

As a result of testing the training performance of anumber of network architectures, both networks ineach test set were given a single hidden layer of 60processing units and trained for 500 epochs. Networksof the first type achieved approximately 85% accuracyon the training sets; networks of the second typeachieved approximately 90';;'. accuracy. Each of thetrained network pairs was separately incorporated intothe program to sequence the spectra of the associatedpartition.

The results of this testing on the 43 spectra areshown in Table 2. The first column displays the knownsequence. The second column lists the partition s, designated A, B, C, or D, to which the peptide spectrabelonged during the three respective testing runs. (Forexample, peptide AA was in partition D in the firstrun, partition C in the second run, and partition A inthe third run.) The third and fourth columns give theorders in which the correct interpretations were foundin the program's output list for each of the three runs.The entire testing process was performed once withfixed scoring constants to obtain the results of columnthree. Column four lists the results obtained when thetests were repeated on the same neural networks butseparately derived scoring constants for each partition.The final column lists the average time in secondstaken to generate the sequences.

During the evolution of the program, scoring constants c1' c2 ' and c3 equal to 2.0, 0.55, and 2.0 weremanually selected through a trial-and-error process toproduce optimal results. These are the constants usedto obtain the ranks in column three. However, thi scolumn cannot be regarded as truly indicative of howthe program would perform on unknown spectra ingeneral, because the constants were not derived independently of the test spectra. These results were included, nevertheless, to demonstrate the stability ofinterpretation between runs when the neural networkswere the only items that were varied.

To obtain the more objective results of column four,scoring constants were derived separately for eachpartition via a technique that op timized the sequencing results when the program was run on the spectrain the other three partitions of the associa ted run. Inthis way, neither the neural networks nor the scoringconstants used to test a partition had " seen" the spectra in the partition.

The scoring function 5 can be thought of as theinn er product of two vectors, C = [ c1, c2 ' c3F andr = [r

L/, r", rpF. If we have the vector r of the true

peptide sequence and the vector of an incorrect sequence, designated r", then if the scoring functionop erates properly, 5 > 5* (5 * = CTr*). Thus, if wedefine X = (r - r*), we should have CTX > O.

The trained neural networks to be used to test agiven partition were incorporated into the program,the scoring ratios were weighted equally (C = [1.0, 1.0,1.0]r), and the program was used to sequence each ofthe spectra in the other three partitions of the run. Foreach spectrum, if the true sequence was found in thefinal sequence list, its vector r and the vectors r" of thetop 10 incorrect sequences were used to construct 10 Xvectors. These vectors were accumulated into a datafile of vectors X I ' i = 1, . .. , K, on which the linearlearning machine [20] optimization technique was employed to derive final values of C for testing thepartition as follows:

1. Set C = [3 - os , r Os , 3 os]r.

2. Set ac , = 0.3. For each vector XI' i = 1, ... , K: If C~Xi < 0, aug-

ment 1C" by yX i , where y = -O.OO1c~x JxTx "4. Set C n + 1 = Crt + ~C11'

J . Normalize CI/t t

b. Loop back to step 2 until C stabilizes.

C was normalized throughout the algorithm to prevent its elements from taking on extremely large values . Approximately 2500 iterations of the loop wererequired during each derivation for the constants toarrive at stable values. The values of the constants foreach partition are displayed in Table 3.

It is apparent from examination of column three ofTable 2 that the program gave reasonably consistentperformance over the three runs when only the neuralnetworks were varied, because 34 (79%) of the sequences received the same rank over the three tests.Twenty-seven (63%) of the correct sequence interpretations received the top rank in the sequence list in allthree test runs, whereas 40 (93%) placed in the topfive.

In only two cases:

NDIAAK and HGTVVLTALGGILK

were the correct sequences not found in the sequencelist. The reason that NDIAAK was not listed is that theLEU/ILE differentiation step misclassifies the isoleucine residue as leucine. In all three testing runs,however, NDLAAK is listed as the top sequence. Inthe case of HGTVVLTALGGILK, the sequenceHGTVVLTAXZVLZ is the top scoring sequence in every run. The spectru m of HGTVVLTALGGILK wasclosely examined and it was found that no ions inthe family that resulted from the cleavage betweenthe two glycines were present. Thus, there was no

I Am Soc Ma ss Speclw l11 1'195, o. ' '-17 'I'" 1'F!' T1DE SEQ UENC INC USING t\;EURA L NETWO RKS 957

Table 2, Resu lts oi -equcncing p rogram on pr eclassi fied reference spectra in the th ree independenttesti ng runs"

Rank RankTest (cons tants (constants Avg CPU

Peptide part itions fi xed) var iab le) t im e Is)

AA D,C,A 1, 1. 1 1,1 ,1 0AAA D,A, C 1. 1. 1 1,1 ,1 0

AAAAA B,C,A 1,1 ,1 1, 1. 1 0AKTE D, D,A 2, 2. 2 2.2.6 0

A LELFR C, B,C 2, 2, 2 1,1 ,1 2

ASnVSKTE D,C,B 1,1 ,1 1, 1, 2 4

DDE C, D, D 1,1 .1 1, 1, 1 0DVVLVDAGLK C, B,B 1,2,5 7, 7, 7 4

EDLIAY B,B, D 2, 2, 2 4,3,2 1

EDLIAYLK A,B,B 4.3.3 2,3,1 4

EMPFPK B,A,D 1. 1, 1 1,1 ,1 2

ETTIDK A ,B, C 1. 1. 1 1.1 ,1 2

FVOWLMNT B, A , B 1. 1. 1 1,1 ,1 3

GGGG A, D, C 1. 1. 1 1, 1. 1 1

GITW K A,C,A 1,1 .1 1,1 ,1 1

GLLG B, A, B 2. 2, 2 3, 3, 2 1

HGTVV LTALGGI LK A, D, B 6

HKIP IK D, C,D 1.2, 1 1,4,2 2

IFVO K A,B,B 1. 1. 1 1,1 ,1 1

IHPF B.B,B 1.1 . 1 1, 1, 1 1

III B, A , D 1.1 .1 1,1 ,1 1

IVV D,D.B 1. 1. 1 1.1 ,1 0

LFTGHPETLEK C, D,C 3, 2,4 7, 3, 6 7

LGG D, D,A 1. 1, 1 1,1 ,1 0LLVY A, B, D 1.1 ,1 1,1 ,1 0LRRASLG D,A,C 3, 3, 3 4, - , 17 2

MAS C, D,A 1,1 ,1 1,1 ,1 1

MGMM A,A,D 1.1 ,1 1,1,1 1

MIFAGIK B,B, D 1,1 ,1 1,1 .1 3

MRF A,A,C 1.1 . 1 1,1 ,1 1

NDIAAK C,B,A 2

OEPVLGPVR A, C, D 2.3,3 2, 2,2 3

SGAGAG D,D,A 1. 1, 1 1,1 ,1 1

TGPNLHGLFGR C, C, C - , 15, 17 6,4,5 8TKY B,C,C 2, 3, 4 2, 2, 3 0TSOVAPA C, A, B 1, 1, 3 2

TYSK C,C,A 1, 1, 1 1, 1, 1 1

VLPVPOK B,B.S 1,1 ,1 1,1 ,1 2

VLS A, A, D 1,1 , 1 1,1 ,1 0VLSEG B, C,C 1.1 .1 2,2.2 1

YIPGTK C,A,A 1. 1. 1 2.2,2 2

YKT C, D, C 1.1 .2 1,1 ,1 1

YPVEPF D,C,A 1. 1, 1 1. 1, 1 1

a Colu mn two indicates th e part iti on of th e pepti de for the three test ing run s. Column thr ee displaysthe test results wh en fix ed scor ing constants 12 .0 , 0 .55, 2 .0) were used. Column four shows th e resu ltswhen th e constants shown in Table 3 were used. Colum n fiv e is the average tim e taken to sequenceth e pept ide. (Absence of rank figures indica tes the correc t sequence was not among th e 50 highestscoring sequences. An indicated CPU time of a s indica tes less than 0.5 s.)

958 SCA RBERRY ET A l J Am SllC M ,bS Spect rum 1995, n, 947-%1

Table 3. Scoring constants deri\l'd tor each partition of thothree testing run s (the sequencin g result s w hen these cons t.m tw er e u sed are d isplaye d in co lumn fou r o f Tab le 2 )

Te stpartition C l C2 C3

l A 0396 0 4 13 0 .8 2

lB 0.409 0.478 0777

l C 0 364 0.46 0 .8 1

10 0 .55 0276 0 .788

2A 0.394 0 .4 74 0 78 7

2B 0343 0.477 0 .809

2 C 0 .437 0 .44 7 07 8 1

20 0 .673 0 .33 0 .662

3A 0.542 0383 0.748

3B 0 .342 0 407 0 .847

3C 0398 0.444 0.803

3D 0 .40 5 0 .394 0 .8 25

Average values 0.438 0.415 0.788

resulting y peak for th is cleavage in the idea lizedspectrum an d, con sequently. the incomplete sequenceHGTVVLTAL wa s never extended by glycine.

Column four di spl ays somewha t surprising resul tsas performance actually improved on six of the spectra, most notabl y on th at of TC PN LHGLFGR. Tw ent ysix (60%) of the co rrect sequences received the toprank, whereas 35 (81t;( ) p laced in the top five. Performance declined on 11 of th e spectra, particularly onthose of LRRASLG and TSQVi\PA. To determine thecause of the decreased performance on TSQ VAP A, thesequencing list wa s inspected at the en d of each extension step. It was learned that, with the new constants,the inco mplete sequence TSQ was purged from the listafter the third extension. When the correct sequence iseliminated at any stage of seq ue ncing . the tr ue sequence will not appear in the final list. The programwas modified to allow manu al insertion of seque ncesinto the final sequence list to see how TSQVAI' Awou ld fare. When it was inse rted, it pla ced first in allthree runs.

Because TSQV APA is elim ina ted at an ea rly stageby the sequence purging process, the obvious solutionin its case is to retain a lar ger number of sequencesduring purging. The program was modified to lest thishypothesis and it wa s found that when the number ofsequences retained was increased to 200, TSQV APAwas correctly sequen ced each run by using the constants of Table 3. It was found that TSQ placed atposit ion 130, 128, and 117 at the end of the th irdextension step in the three respective runs. The sequencing time d oubled fro m 2 to 4 s.

The problem presented by HGTVVLTALGGILK ism uch more com plex. Any app roach m ust in vo lve a llowing sequence extension by amino acids that are notcorrobo rated by idea lized II peaks. Therefore, the program was modified so that incomplete sequences wereextended by all the amino acid tvpes instead of on ly

those supported by a peak in the idealized spectrumbefore repea ting the tests of column three on th ispeptide. In all three of the tests, however, the correctseque nce was still omitted from the final list. Eventhough the incomplete sequence, HGTVVLTAL, wasextended by glycin e, the correct seque nce was eliminated in one of the purging steps. When the number ofsequences reta ined during p urging was increa sed to500, the program still omitted the correct sequencefro m the final lists and processing tim e increased to 3min and 45 s. When the sequencing algorithm was leftunchan ged and the correct seque nce was man uallyinserted into the final lists , it placed at ranks 13, 20,and 8 when the fixed cons tants were used . Wh en thesco ring cons tan ts of Table 3 were used, it placed atranks 4h, 18, and 43. Thus, ev en if th e cor rect partialseque nce had not been eliminated in purging, thecom plete seq ue nce wou ld not have fared we ll in thefina l lists.

Fro m exa mination of th e m ass spectrum ofHCTVV LTALGGILK, it seems tha t th e fault for thepo or performance lies more wi th the data itself thanwi th the seq ue ncing algorithm . Three consecutivecleavages are ve ry poo rly represented by fragmention s in the ori ginal spec tru m. As previo us ly sta ted , theglycine-glycine cleava ge is backed by no ions of the 11types in the origin al spectrum. Ho wever, this is not theonly cleavage w ith littl e supporting informat ion in theorig ina l spec tru m. The cleav age immediat ely preceding it (leucine-glycine) is supported only by an all

peak and the one immed iat ely after is supported byonly two du-type peaks. The presence of these ionsresults in tw o very minor peaks in the idealized spectrum. Because of the poor quality of the data for thesecleavag e locations, the program pe rforma nce nat urallydiminishes as the sequences are extended into thisregio n .

We have concluded from observation of the neuralnetwork train ing that performanc e w ould proba bly beimproved significantly by training the neural networkson m uch larger amounts of data, der ived perha ps fromhundreds of spectra. For ANNs such as are used byth is p rog ram to ope ra te effect ively, the d ata se ts onwhich they are trained need to contain sufficientlylarge an d ~' aried d istribution s of pa ttern s so that theyrepresent well the problem space in general. If weass ume that the patterns are well dis trib uted ove r theinput space, the larger the data set, the more likely thatthe trained network will hav e a stable clas si ficationpower in mapping from input space to classificationspace . This abi lity is often called generaliza tion. If thedata are somewhat sparse, however, the neural networks will eventually overtrain on the train in g patterns and essentially memorize them without acquisition of ad d itiona l ability to classify unkn owns.

On e wa y to determine if overtraining is taking placeis to se t aside a certain percen tage of the pattern s inthe data set to be used only for monitoring trainingprogress. whi le the rest of the pattern s are used to

I ' EI'TI D E SEQU ENCI.\jC USINC NEC RA L N ETWORKS 959

compute the act ua l we ight corrections. This met hodwas incorpora ted in to the softwa re used to construc tand train the ANNs used in the seque ncing program .In each training sess ion, 25';, o f the patte rns wererandomly selected and se t as id e to be used only formon itorin g tra in ing progress; the rest of the patternswe re used to comp ute the ac tual weight correct ion s.At the end of eac h train ing epoch, the cumulativemean- squared er ro r for both sets of patterns wa s written to a file. In the ea rly epochs in tra ining, both errormeasures declined , but w he n ove rtrain ing began tooccu r, the mean-squ ared erro r on th e training patternscon tinu ed to decline, while the mean squa red error onthe monitoring patterns lev eled off and actually beganto increase. The left graph of Figur e 9 shows the erro rprofiles for a network like that use d as network onewhen tra ined for 1000 epochs on a da ta set ge nera tedfrom all 43 reference spectra . The mean-squared er roron the tra ining patterns (lower curve ) declined stead ilythro ugh out the train ing, bu t the mea n-squared erro r ofthe se t as ide pa tte rns (upper cu rve ) began to inc reasea t approximately 500 epoc hs . Ihe minimum mean squared error on the mon itor ing patterns wa s actuall yatta ine d on a downward spike at epoch 140. Th e rightside of the figure d isplays a similar graph for a net work of typ e tw o. It too, began to overtra in w ith in thefirs t 500 training ep ochs, wi th minimum mean- squar ederror attai ned on the mon itor ing patterns at epoc h 310.

We designed the neural net work software used inth is project to sa ve neural ne tworks during train ingwhe n they are a t their maximum generalizing poten tial, as esti mated from mon itor ing the mean-squar ederror of the patterns set asid e from compu ting theweight corre ctions. A net wor k is saved to a di sk fileduring training w he never its mean- squared error onthe set as ide patterns reaches a new minimum. Forexample, the software saved the two networks illustrat ed by Figures 9 and HJ a t epo chs 140 and 310,

respective ly. Thus, all the netw orks used in th e testingpr ocess we re saved a t a fairly ea rly s tage in train ing.With a mu ch lar ger da ta se t, ove rtraining would bede laye d until a mu ch later training epoch, and network performan ce wou ld increas e corresp ondingly.

For ty- two amino acid resid ues in the set of reference spectra were e ithe r leu cine or isoleucin e.Twen ty-seven of these (64.3% ) were identified correctly by the d ifferentia tion routine. Only the isoleu cineresidue in NDIAAK was misidentified as leucine an dall the rest (33.3% ) were identified as indeterminate.The se ambigu ou s residues are denoted by Xs in thefina l sequence list.

In the development of thi s program, it was obse rved that glutamine and lysine could not be d istinguished with a hi gh degr ee of confidence. To remindthe user of th is fact, the lette r Z is used to den ote bothglu tamine and lysine in the final list. This ambiguitymay pose littl e diffi culty if the histor y of the pepti de isknown (e.g., if it was der ived from a tryptic clea vage)or if the acetylated derivati ve has bee n analyzed .

The final seque ncing progr am was implemented inC language and Xy Moti f on a Silicon Graphics Indigoworks tation (SC I Inc., Mountain View , CA). Figure 10shows the progr am 's grap hical output w he n tested onthe spectru m for VLSEG. The main window is di videdin to two scrollable viewpor ts. The top di splays theorigina l pr ep rocessed spec trum, whe reas the bottomviewport depicts the idealized spectru m. When theuser se lects " Build sequences .. . r r from the controlmenu, the program pops up a seque nce d ialog todi splay th e results. If one of the d isp layed sequen ces isselected, a dotted seque nce trace line w ill connect thetops of the peaks in the ideal ized spectrum .

The program also allows the user to select a peakfrom the original spe ctrum by entering a mouse clickin the top viewport. Choosing " Peak info ... " from thecon trol menu di splays a dialog box that contains the

0.1

LogM.:sE-0.1

lAgMSE.·os

.,

~ ~ ~ ~ ~ ~ ~ ~ *T,oirling lltnlion

., __

·2

00 ~ ~ ~ ~ ~ ~ ~ ~

Trllininglltnlion

."L..o.._~_~_~_~_~_~_~_~_~__

.t2

·1.4

· IS

·0.'

·IJ

-u

figure 9. Graphs of the train ing p rogress of an AN N of typ e one (left ) and of an AN N of typ e two(rig ht). Both grap hs show how the logar ith m of the cu mu lative mean-squ ared error terms behavedas the networ ks were train ed . In eac h grap h, the bottom curve shows the mean-sq uared erro r for thepatte rns used to comp ute weigh t corrections, w hereas the top is of the mean-squ ared erro r of thepatterns set as ide for mon ito ring training pr ogress.

960 SCAR13ERRY rr A I J Am Soc Mass Spec trum 1995, 6, 947-961

~ ..~ . ... .' "

.~

~~~::

r:

Sequenc«

Cil'" ·PMVLSW 3.27·VLSSV 3.17·VLSAD 2.99·VLSDA 2.99·

III\!1VLESO 2.91·VLCIG 2.93·

l i:i:!!i!!.VLDTO 2.86·VI.TOO 2.8S·VI.XCG 2.83· ~£(~,;!~ete~,~qu~._~_-,I

j AVtlllge YHtight: 0.748

i AvengeIonHtight: 1.025l! IonRatio: 0.683t._.#~ .."""",.....~.""".......""".._"",,~_..........-__CPU Time: om

---,-

I400

I I

Ijj

dI I. J

LEU VAL"'r lC> - -- ... "" ...... .. _ ..._ .,. ~ .....

"

i

SIRGLU

View

1~O

GLY

l~O 2~0

.' cr- --------------r",, / i I ! :.1

File Control

Figure 10. The sequencin g program di sp lay as it ap pears in th e X/ Mo tif environment. In th e actualscreen d isplav. the peaks in the upper spectru m are color coded to indicate the ion type asdetermined by the neura l net w or k mod ule .

classification results of both neural networks for thechosen peak . If the user selects an idealized peak fromthe lower viewport, small tick marks will appear in thetop viewport under the pe aks from which th e selectedidealized peak was derived.

The seque nce trace grea tly ass ists the user in es timating the validity of the sequence interpretation.Gene rally for correct sequences, the trac e connec ts thetops of prominent peaks. In some cases the user canes tima te the correctness of parts of the overa ll sequence. In sequencing the spectrum of HGTVVLTALGGILK, for exa mp le, the program gave HGTVV LTAXZVLZ as the top rated interpretation in each of thetest runs w ith fixed cons tan ts and in tw o of the testruns with separate scoring constants. When the sequ ence HGTVVLTAXZVLZ is selected in the sequencedialog, the segments of the trace that correspond toHGTVVLTAX connec ted prominent peaks, which suggests that this portion of the "equence was probablycorrec t. The sections of the idealize d spectra th roughwhich the rest of the traces passed, however, containedonly small peaks.

Conclusions

The sign ificant time req uired to manua lly sequencepeptides from high-energy CID tandem mass spectranecessitates the development of compute r programs to

perform this task if the speed of mass spectrometry isto be full y exploited . The results of the work describedhere demonstrate that artificial neural networks maybe used effective ly to meet thi s need. The programdescribed in this paper sequences peptides in a verytim ely man ner with the lon gest tim e taken so far bein gonly 6 s for a peptide of nominal mass 1168 u. Thesuccess ra te is not such that the user is free d entirelyfrom scrutinizing the results. Nevertheless, the visualoutput p rovided allows the user to qualitati vel y judgethe results so that overall interpretation time may bedrastically red uce d over man ua l in terp re ta tion.

The results obtained with this prototype algorithm<He suffici ently indicative of the viability of th is approach to warrant proceeding with development ofneural networks train ed wi th much larger data sets(i.e.. hundreds of spectra). In addition, after the collection of a mu ch lar ger se t of preclassified spec tra fromwhich to generate data, we intend to investigate morethoroughly vari ous input vector formats and the relative importance of individual elements. Such trainingand ana lysis may enta il use of supercomp uter capabil ity to be accomplished in a reasonable time becausethe train ing of a single ne twork pair on the curren tdata sets requires several hours. We are, therefore,proceed ing w ith plans to extend thi s approach to largertraining sets of unmodified peptide spectra as well asto spec tra of modified pc p tides.

1'1 '1' 111JI. SF(JLE1\C1NC USIN C N I:U I{ A l. N ET WO RKS 961

Intuitively, it seems tha t th is algori thm also couldbe applied to the interp reta tion of low-en ergy CIDmass spectra oi peptides: howe ver , because of thedifferen t character of low-energy CID spectra, such anadaptation oi the p rogram wo uld require training oithe neural networks on such spectra and the rulesgoverning the presence of ir,lg mentation ions alsowo uld have to be mod ified . The cur rent algori thmassumes the mass resolution of the peak data to besuificient ly high tha t pea k masses may be converted tointegers and that subseq uent mass comparisons between the peaks m"y be safely compared to the nominal amino acid residue masses wit h no ma rgin of error.Because low-energy CID spectra are typically recordedat a much low er mass reso lu lion tha n high -energyCID, a margin of error might have to be introduced tocom pare mass d ifferen ces between ions. This wouldun doubtedl y in trod uce addi tional progra mming CO Ill

plexity.

AcknowledgmentsWe wou ld like to tha n k Wad e \1. Hiru -, to r tl1<' spectru m pi till'peptide DVV LVDAG LK. Wor k \\' ,lS suppo rted in part by 1\ IHgrant EY01l23Y. Th e JEOL H Xl ]()/ H Xll ti inst ru rru-nt was tu nd edin pa rt by \lSF gra n t Dl R H~-tJ45()2 .

References

1. Bierna nn . K. All/II /. I{e,' Hi,ll·i"·I1/. 1992, bl. YY7 -1(l1(l.

2. Ishikawa . K.; \l i\\'a . Y. Iholl/I'd. Luriro». Alas' Sp'-(/I'll lll . 1986.13. 3733HtJ.

.1. Ha m ill. C. W.; Wilson. W. E.; Ha rvan. D. J. COIl1P I/ 1. Appl.Riosci. 1986, 2. 115- 11H.

~ . Jo hns o n, R. S.; Biernann . K. Rioll1l'd. Elluiroll . Mass Spcctrom.1989, IS. 945 -%7.

~ . Hines. W. tv1.; f'a lic k, A. M.; Burl ingame, A. L.; G ibso n, B. W./. A 111. Soc. Mass Specirolli. 1992, 3, 326-336.

h . Hiemann, K. •\;k lll. EII:.l/lI1ol. 1990, 193, 886-887., . Io hnso n, R. S.; Ma rl in , S. A.; Biemann, K. 1111. [ . Mass Spec

tnuu. 10 11 Procl'sses 1988, 86, 137,-154.H. lo hnso n, R. S.; Ma rtin , S. A.; Bie man n, K.; Stul ts, J. T.; Wa t

son , J. T. A11111 . CIII'II/. 1987, 59. 262 1-2625.Y. Joh nson, R. S. Ph.D. di ss ertation , Massach use t ts Insti tu te o f

Tech nology. Ca m b ridge, MA, 1Y88.10. Downard , K. M.; Biernann, K. /. Alii. Soc. Mass Spectrom. 1993,

-l, 87'+-H81.11. Mu eller. P., Lazza ro . J. In Nell1'111 Networks for Computing;

Den ke r, .I ., Ed.: Am er ican Inst itu te of Physics: New York,1Y86.

12. Scarbe rry. R. E.; Zha ng , Z. Proceedings of the Arlificial NeuralNelll'olks ill Ellgineaillg Conti-rellce; SI. Lo u is, MO, 199 1; P 351.

11. Ma rtin . C . l..; Pittman , J. A . Proceedings of the IEEE Conference, 1/1 NI'I/ml Intil1'l/1I11 ioll Proccssins; Sys lI'IIIS; San Ma teo, CA,November 1990; pp ~()5 -4 1 4 .

14. C O<ld acre, R.; Karim, A.; Kaderbhai. M. A.; Kell , D. 13. f.Biotcchnot. 1994, 3-l , lH5- 193.

\3 . Lip pman . R. P, IEEE As sP ,vlas . 1987, -l, 4- 22.ln. Ko lm og or o v, A. N . Dokl. Akad. Nau): USSR 195 7, 114, 953 -956.17. Hech t-Nielsen, R. PI'Ilceedillgs llf tnc international Conjerence 0 11

N cu rn! Netit'llrks ll; IEEE Press : New Yor k, 1987; pp 131- 139.IS. Werb os, P. .I . Doc tora l Disser ta tio n, Ap p !' Math ., Har va rd

Un ive rs ity, \Y74.19. Rum elhart, D. E.; V1 cC lell and , J. L., Ed s . Parallel Distributed

Pn lce5.,illg : Exp/oralioll5 in the Microslructl/le of Cognition, I, &1/; MIT Press : Ca mb ridge . MA, 1986.

~() . lu rs. r . c. In Om/pula s4 111'tl/'e Applicatiolls in Chemistry;Wi ley: \ !Low York . 191>6; P 186.

peptide sequence determination from high-energy …95)00477-u.pdfpeptide sequence determination from...

Documents