computational methods for identification of cyclic peptides...

61
Computational Methods For Identification Of Cyclic Peptides Using Mass Spectrometry Julio Ng Bioinformatics Program, UCSD March, 26 th 2010

Upload: others

Post on 14-Feb-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

  • Computational Methods For Identification Of Cyclic Peptides

    Using Mass Spectrometry

    Julio NgBioinformatics Program, UCSD

    March, 26th 2010

  • Outline

    • Importance of natural products• Mass spectrometry on cyclic peptides• Computational methods to analyze MS data• Demo

  • Natural Products

    • In 1928, A. Fleming discovered antibiotic activity of penicillin

    • The beginning of the modernera of drug discovery

    Alexander Fleming

  • Natural Products

    • Chemical compound biological activity• Antibiotics (colistin)• Immunosuppressors (cyclosporin)• Antiviral agents (luzopeptin A) • Antitumor agents (phakellistatin)• Toxins (amanitin)

  • Natural Products

    Natural Products as Sources of New Drugs over the Last 25 Years!

    David J. Newman* and Gordon M. Cragg

    Natural Products Branch, DeVelopmental Therapeutics Program, DiVision of Cancer Treatment and Diagnosis, National CancerInstitute-Frederick, P.O. Box B, Frederick, Maryland 21702

    ReceiVed October 10, 2006

    This review is an updated and expanded version of two prior reviews that were published in this journal in 1997 and2003. In the case of all approved agents the time frame has been extended to include the 251/2 years from 01/1981 to06/2006 for all diseases worldwide and from 1950 (earliest so far identified) to 06/2006 for all approved antitumordrugs worldwide. We have continued to utilize our secondary subdivision of a “natural product mimic” or “NM” to jointhe original primary divisions. From the data presented, the utility of natural products as sources of novel structures, butnot necessarily the final drug entity, is still alive and well. Thus, in the area of cancer, over the time frame from aroundthe 1940s to date, of the 155 small molecules, 73% are other than “S” (synthetic), with 47% actually being eithernatural products or directly derived therefrom. In other areas, the influence of natural product structures is quite marked,with, as expected from prior information, the antiinfective area being dependent on natural products and their structures.Although combinatorial chemistry techniques have succeeded as methods of optimizing structures and have, in fact,been used in the optimization of many recently approved agents, we are able to identify only one de noVo combinatorialcompound approved as a drug in this 25 plus year time frame. We wish to draw the attention of readers to the rapidlyevolving recognition that a significant number of natural product drugs/leads are actually produced by microbes and/ormicrobial interactions with the “host from whence it was isolated”, and therefore we consider that this area of naturalproduct research should be expanded significantly.

    It is over nine years since the publication of our first,1 and three

    years since the second,2 analysis of the sources of new and approved

    drugs for the treatment of human diseases, both of which indicated

    that natural products continued to play a highly significant role in

    the drug discovery and development process.

    That this influence of Nature in one guise or another has

    continued is shown by inspection of the information given below,

    where with the advantage of now over 25 years of data, we have

    been able to refine the system, eliminating a few duplicative entries

    that crept into the original data sets. In particular, as behooves

    authors from the National Cancer Institute (NCI), in the specific

    case of cancer treatments, we have gone back to consult the records

    of the FDA and added to these, comments from investigators who

    have informed us over the past two years of compounds that may

    have been approved in other countries and that were not captured

    in our earlier searches. These cancer data will be presented as a

    stand-alone section as well as including the last 25 years of data in

    the overall discussion.

    As we mentioned in our 2003 review,2 the development of high-

    throughput screens based on molecular targets had led to a demand

    for the generation of large libraries of compounds to satisfy the

    enormous capacities of these screens. As we mentioned at that time,

    the shift away from large combinatorial libraries has continued,

    with the emphasis now being on small, focused (100 to ∼3000)collections that contain much of the “structural aspects” of natural

    products. Various names have been given to this process, including

    “Diversity Oriented Syntheses”,3-6 but we prefer to simply say

    “more natural product-like”, in terms of their combinations of

    heteroatoms and significant numbers of chiral centers within a single

    molecule,7 or even “natural product mimics” if they happen to be

    direct competitive inhibitors of the natural substrate. It should also

    be pointed out that Lipinski’s fifth rule effectively states that the

    first four rules do not apply to natural products or to any molecule

    that is recognized by an active transport system when considering

    “druggable chemical entities”.8-10

    Although combinatorial chemistry in one or more of its

    manifestations has now been used as a discovery source for

    approximately 70% of the time covered by this review, to date, we

    can find only one de noVo new chemical entity (NCE) reported inthe public domain as resulting from this method of chemical

    discovery and approved for drug use anywhere. This is the antitumor

    compound known as sorafenib (Nexavar, 1) from Bayer, approved

    by the FDA in 2005. It was known during development as BAY-

    43-9006 and is a multikinase inhibitor, targeting several serine/

    threonine and receptor tyrosine kinases (RAF kinase, VEGFR-2,

    VEGFR-3, PDGFR-beta, KIT, and FLT-3) and is in multiple clinical

    trials as both combination and single-agent therapies at the present

    time, a common practice once approved for one class of cancer

    treatment.

    As mentioned by the authors in prior reviews on this topic and

    others, the developmental capability of combinatorial chemistry as

    a means for structural optimization once an active skeleton has been

    identified is without par. The expected surge in productivity,

    however, has not materialized; thus, the number of new active

    substances (NASs), also known as New Chemical Entities (NCEs),

    which we consider to encompass all molecules, including biologics

    and vaccines, from our data set hit a 24-year low of 25 in 2004

    (though 28% of these were assigned to the ND category), with a

    rebound to 54 in 2005, with 24% being N or ND and 37% being

    biologics (B) or vaccines (V). Fortunately, however, research being

    conducted by groups such as Danishefsky’s, Ganesan’s, Nicolaou’s,

    Porco’s, Quinn’s, Schreiber’s, Shair’s, Waldmann’s, and Wipf’s is

    continuing the modification of active natural product skeletons as

    leads to novel agents, so in due course, the numbers of materials

    developed by linking Mother Nature to combinatorial synthetic

    techniques should increase. This aspect, plus the potential contribu-

    tions from the utilization of genetic analyses of microbes, will be

    discussed at the end of this review.

    Against this backdrop, we now present an updated analysis of

    the role of natural products in the drug discovery and development

    process, dating from 01/1981 through 06/2006. As in our earlier

    ! Dedicated to the late Dr. Kenneth L. Rinehart of the University ofIllinois at Urbana-Champaign for his pioneering work on bioactive naturalproducts.* To whom correspondence should be addressed. Tel: (301) 846-5387.

    Fax: (301) 846-6178. E-mail: [email protected].

    461J. Nat. Prod. 2007, 70, 461-477

    10.1021/np068054v This article not subject to U.S. Copyright. Published 2007 by the Am. Chem. Soc. and the Am. Soc. of Pharmacogn.Published on Web 02/20/2007

  • Natural Products

    • Searching for natural products• Plants• Micro-organisms• Marine organisms• Animal

    • A large subclass of natural productsare nonribosomal peptides

  • Central Dogma of Biology

    NRP

  • Non-ribosomal Protein Synthetase (NRPS)

    Sieber and Marahiel 2005

    2. NRPS Factory

    Although structurally diverse, most biologicallyproduced peptides share a common mode of synthe-sis, the multienzyme thiotemplate mechanism.2,6,40According to this model peptide bond formation takesplace on large multienzyme complexes, which simul-taneously represent template and biosynthetic ma-chinery. Sequencing of genes encoding NRPSs ofbacterial and fungal origin provided insights intomolecular architecture and revealed a modular or-ganization.6 A module is a distinct section of themultienzyme that is responsible for the incorporationof one specific amino acid into the final product.3,6,41It is further subdivided into a catalytically indepen-dent set of domains responsible for substrate recogni-tion, activation, binding, modification, elongation,and release. Domains can be identified at the proteinlevel by characteristic highly conserved sequencemotifs. Thus far, 10 different domains are knownwithin NRPS templates which catalyze independentchemical reactions and will be introduced in moredetail in the following sections. As an example toillustrate basic principles, Figure 2 shows a prototype

    NRPS assembly line for the cyclic lipoheptapeptidesurfactin.42

    The carboxy group of amino acid building blocksis first activated by ATP hydrolysis to afford thecorresponding aminoacyl-adenylate. This reactiveintermediate is transferred onto the free thiol groupof an enzyme-bound 4!-phosphopantetheinyl cofactor(ppan), establishing a covalent linkage betweenenzyme and substrate. At this stage the substratecan undergo modifications such as epimerization orN-methylation. Assembly of the final product thenoccurs by a series of peptide bond formation steps(elongation) between the downstream building blockwith its free amine and the carboxy-thioester of theupstream substrate. The ppan cofactor facilitates theordered transfer of thioester substrates betweencatalytically active units with all intermediates co-valently tethered to the multienzyme until the prod-uct is released by the action of the C-terminalthioesterase (TE) domain (termination). This strategyminimizes side reactions as well as diffusion times.Type I polyketide synthases (PKS) and fatty acidsynthases (FAS) similarly display a multienzymatic

    Figure 2. Surfactin assembly line. The multienzyme complex consists of seven modules (grey and red) which are specificfor the incorporation of seven amino acids. Twenty-four domains of five different types (C, A, PCP, E, and TE) are responsiblefor the catalysis of 24 chemical reactions. Twenty-three reactions are required for peptide elongation, while the last domainis unique and required for peptide release by cyclization.

    718 Chemical Reviews, 2005, Vol. 105, No. 2 Sieber and Marahiel

  • Non-ribosomal Protein Synthetase (NRPS)

    Sieber and Marahiel 2005

  • Thioesterase Domain

    Sieber and Marahiel 2005

  • Cyclosporin is highly lipophilic, and 7 of its 11 aminoacids are N-methylated. This high degree of meth-ylation protects the peptide from proteolytic digestionbut complicates chemical synthesis due to low coup-ling yields and side reactions.35 In an iron-deficientenvironment some bacteria such as E. coli, B. subtilis,and Vibrio cholerae synthesize and secrete iron-chelating molecules known as siderophores thatscavenge Fe3+ with picomolar affinity, important forhost survival.36,37 Three catechol ligands derived from2,3-dihydroxybenzoyl (DHB) building blocks in bacil-libactin, enterobactin, and vibriobactin complex ironby forming intramolecular octahedra.

    Many nonribosomal peptide products presentedhere show distinct chemical modifications, importantto specifically interact and inhibit certain cellularfunctions, which are essential for survival. The hightoxicity of the peptide products could therefore alsobecome a problem for the producer organism unlessstrategies for its own protection and immunity havebeen coevolved with antibiotic biosynthesis. Thisimmunity is achieved by several strategies including

    efflux pumps, temporary product inactivation, andmodifications of the target in the producer strain.3The latter strategy is used by vancomycin-producingStreptomycetes by changing the D-Ala-D-Ala terminusof the peptidoglycan pentapeptide precursor to aD-Ala-D-lactate terminus, which reduces binding af-finity to vancomycin 1000-fold.12

    Due to their exceptional pharmacological activities,many compounds such as cyclosporin and vancomy-cin have been synthesized nonenzymatically.38,39 Re-gio- and stereoselective reactions require the use ofprotecting groups as well as chiral catalysts. More-over, macrocyclization and coupling of N-methylatedpeptide bonds are difficult to achieve in satisfyingyields, indicating an advantage of natural vs syn-thetic strategies. Structural peculiarities of thesecomplex peptide products suggested early on a nucleic-acid-independent biosynthesis facilitated by multiplecatalytic domains expressed as a single multidomainprotein. The diverse chemical reactions mediated bydistinct enzymatic units will be the focus of thefollowing sections.

    Figure 1. Natural peptidic products. A selection of nonribosomally synthesized peptides. Characteristic structural featuresare highlighted.

    Approaches to New Antibiotics Chemical Reviews, 2005, Vol. 105, No. 2 717

    Special Characteristics

    • Heterocyclic elements• D-amino acids• Glycosylated residues • N-methylated residues• Non-standard amino acids• Cyclic backbone

  • Cyclic Peptides

    http://bioinfo.lifl.fr/norine/ *

    out of 1122 entries in the database

    *Caboche et al, 2008

  • Mass Spectrometer

    Measures m/z

  • Sample Preparation (Protein Analysis)

    Enzymatic Digestionand

    Fractionation

  • Multi-Stage Mass Spectrometry

    Secondary Fragmentation

    Ionized parent peptide

    Mass Spectrometer

  • Fragmentation

    H...-HN-CH-CO-NH-CH-CO-NH-CH-CO-…OH

    Ri-1

    Ri R

    i+1

    AA residuei-1

    AA residuei AA residuei+1

    N-terminus C-terminus

    H+

  • Identification of Linear Mass Spectra

    MS/MS spectrum

    : b

    y:

    PM

    Database of

    known peptides

    MDERHILNM, KLQWVCSDL,

    PTYWASDL, ENQIKRSACVM,

    TLACHGGEM, NGALPQWRT,

    HLLERTKMNVV, GGPASSDA,

    GGLITGMQSD, MQPLMNWE,

    ALKIIMNVRT, LARGE, HEWAILF, GHNLWAMNAC,

    GVFGSVLRA, EKLNKAATYIN..

    GALE

    K

    M

    NR

    E

    EY

    LGALR

    E

    Database search

    De novo sequencing

    LARGE

  • Challenges in Identification of Cyclic Peptide MS

    • Extensively modified amino acids• Non-standard amino acids• Cyclic backbone• Databases cannot be readily derived from

    genomic data

  • Cyclic Peptide Mass Spectrum

    MS1 – Mass of the

    intact cyclic peptide

    MS2 – Mass of the

    intact linear peptides

    MS3 – Masses of the

    peptide fragments

  • Ms Mixture! " # $ % &'()

    !"

    #"

    $"

    !" # $ % &'()

    ! " # $ %&'()

    ! "

    #$ %

    &'()

    !

    ! "

    "

    #

    # $

    $%

    %

    &'()

    &'()

    &'()

    !

    "

    #

    $

    %

    %"

    ! " # $ % &'()

    !"

    #"

    $"

    !" # $ % &'()

    ! " # $ %&'()

    ! "

    #$ %

    &'()

    !

    ! "

    "

    #

    # $

    $%

    %

    &'()

    &'()

    &'()

    !

    "

    #

    $

    %

    %"

    Seglitide: somatostatin receptor antagonist, used experimentally to treat Alzheimer’s disease

  • Cyc(A+14YWKV)

    ! " # $ % &'()

    !"

    #"

    $"

    !" # $ % &'()

    ! " # $ %&'()

    ! "

    #$ %

    &'()

    !

    ! "

    "

    #

    # $

    $%

    %

    &'()

    &'()

    &'()

    !

    "

    #

    $

    %

    %"

    Cyclic Mass Spectrum

    NRP-Dereplication

    NRP-Tagging

    NRP-Sequencing

    Identification of Cyclic Mass Spectra

  • NRP-Dereplication

    • Case 1:• There is a peptide in the database that matches the precursor mass of

    the spectrum.

    • Is this peptide a good match for the spectrum?

    • Case 2:• No peptide in the database matches the precursor mass of the spectrum• Can we change a peptide in the database so it becomes a good match

    for the given spectrum?

  • Simplified Dereplication Problem Formulation

    • Input: MS3 spectrum, a Peptide Sequence and parameter k• Output: A new Peptide Sequence with k mutations away from

    the original Peptide Sequence such that the new peptide explains best the experimental spectrum.

    • In reality there many peptides in the database, so the dereplication needs to be done for each peptide

    PEPTIDE

  • Tyrocidine A 99, 114, 113, 147, 97, 147, 147, 114, 128, 163

    Tyrocidine A1 99, 128, 113, 147, 97, 147, 147, 114, 128, 163

    Tyrocidine B 99, 114, 113, 147, 97, 186, 147, 114, 128, 163

    Tyrocidine B1 99, 128, 113, 147, 97, 186, 147, 114, 128, 163

    Tyrocidine C 99, 114, 113, 147, 97, 186, 186, 114, 128, 163

    Tyrocidine C1 99, 128, 113, 147, 97, 186, 186, 114, 128, 163

    Tyrocidine Family (Bacillus brevis)

  • Dereplication (k = 1)

    A B C D E F

  • A B C D E F

    A AB

    NRP-Dereplication (k = 1)

  • A B C D E F

    A AB

    Δ

    ABC-Δ ABCD-Δ ABCDE-Δ

    NRP-Dereplication (k = 1)

  • A B C D E F

    A AB

    Δ

    ~DEF ~EF ~F

    NRP-Dereplication (k = 1)

  • A B C D E F

    A AB

    Δ

    ~DEF ~EF ~F

    FA E FA B D E F

    A B C D E F

    NRP-Dereplication (k = 1)

  • Dereplicating tyrocidine C and C1

    • Experimental spectrum:• Tyrocidine C1

    • Sequence:• Tyrocidine C VOLFPWWNQY

    • Offset:• 14 Daltons (O -> K)

    V-1 O-2 L-3 F-4 P-5 W-6 W-7 N-8 Q-9 Y-10

    0

    5

    10

    15

    20a) b)

    V

    9.0

    O

    2.5

    L

    7.0

    F

    9.0

    P

    13.0

    W

    16.0

    W

    19.0

    N

    19.5

    Q

    18.0

    Y

    15.0

    32

    c)Coverage

    Coverage

    Peptide Tyrocidine A Tyrocidine B Tyrocidine B

    Sequence VOLFPFFNQY VOLFPWFNQY VOLFPWFNQY

    Dereplicated Tyrocidine A1 Tyrocidine B1 Tyrocidine C

    Sequence VKLFPFFNQY VKLFPWFNQY VOLFPWWNQY

    Coverage

    V-1 O-2 L-3 F-4 P-5 F-6 F-7 N-8 Q-9 Y-10

    0

    5

    10

    15

    20

    25

    V-1 O-2 L-3 F-4 P-5 W-6 F-7 N-8 Q-9 Y-10

    0

    5

    10

    15

    20

    V-1 O-2 L-3 F-4 P-5 W-6 F-7 N-8 Q-9 Y-10

    0

    5

    10

    15

    20

    1

    Figure A-4: Dereplication results. a) NRP-Dereplication output for experimental spectrum of tyrocidine C1

    (VKLFPWWNQY) given peptide sequence of tyrocidine C (VOLFPWWNQY). Concentric red-gray circles

    represent 0-correlated subpeptide (with peptide shown red and its complement shown gray) and δ-correlatedsubpeptides (with peptide shown gray and its complement shown red). Given this coloring convention, the

    amino acid coverage (number of red arcs covering an amino acid) represents supporting evidence that an

    amino acids did not change from the known to the unknown compound. The thick black circle separates0-correlated subpeptides (shown inside) from δ-correlated subpeptides (shown outside). The outer countsrepresent the coverage for a given amino acid by red arcs and reveals the differing amino acid (O) as theamino acid with minimum coverage (2.5 vs. 7 for the next runner-up). The counts are normalized by the

    number of subpeptides per peak. For example, if a peak has two alternative subpeptide annotations, it will

    contribute12 to the coverage. The width of the arcs are proportional to this weighting factor. The number in

    the center of the graphs is the total number of correlated subpeptides. b) Alternative representation of a)

    as a histogram that reveals the changed amino acid O. c) Additional dereplication results for the tyrocidine

    family.

    20

    V-1 O-2 L-3 F-4 P-5 W-6 W-7 N-8 Q-9 Y-10

    0

    5

    10

    15

    20a) b)

    V

    9.0

    O

    2.5

    L

    7.0

    F

    9.0

    P

    13.0

    W

    16.0

    W

    19.0

    N

    19.5

    Q

    18.0

    Y

    15.0

    32

    c)

    Coverage

    Coverage

    Peptide Tyrocidine A Tyrocidine B Tyrocidine B

    Sequence VOLFPFFNQY VOLFPWFNQY VOLFPWFNQY

    Dereplicated Tyrocidine A1 Tyrocidine B1 Tyrocidine C

    Sequence VKLFPFFNQY VKLFPWFNQY VOLFPWWNQY

    Coverage

    V-1 O-2 L-3 F-4 P-5 F-6 F-7 N-8 Q-9 Y-10

    0

    5

    10

    15

    20

    25

    V-1 O-2 L-3 F-4 P-5 W-6 F-7 N-8 Q-9 Y-10

    0

    5

    10

    15

    20

    V-1 O-2 L-3 F-4 P-5 W-6 F-7 N-8 Q-9 Y-10

    0

    5

    10

    15

    20

    1

    Figure A-4: Dereplication results. a) NRP-Dereplication output for experimental spectrum of tyrocidine C1

    (VKLFPWWNQY) given peptide sequence of tyrocidine C (VOLFPWWNQY). Concentric red-gray circles

    represent 0-correlated subpeptide (with peptide shown red and its complement shown gray) and δ-correlatedsubpeptides (with peptide shown gray and its complement shown red). Given this coloring convention, the

    amino acid coverage (number of red arcs covering an amino acid) represents supporting evidence that an

    amino acids did not change from the known to the unknown compound. The thick black circle separates0-correlated subpeptides (shown inside) from δ-correlated subpeptides (shown outside). The outer countsrepresent the coverage for a given amino acid by red arcs and reveals the differing amino acid (O) as theamino acid with minimum coverage (2.5 vs. 7 for the next runner-up). The counts are normalized by the

    number of subpeptides per peak. For example, if a peak has two alternative subpeptide annotations, it will

    contribute12 to the coverage. The width of the arcs are proportional to this weighting factor. The number in

    the center of the graphs is the total number of correlated subpeptides. b) Alternative representation of a)

    as a histogram that reveals the changed amino acid O. c) Additional dereplication results for the tyrocidine

    family.

    20

  • Dereplication Results

  • NRP-Dereplication Results on NORINECompound Top Match(es) Dereplicated Compound Score

    Destruxin A

    Destruxin A[+14] Pro, Ile, NMe-Val, NMe-Ala, bAla, C4:1(3)-OH(2)[+14] 0.45HydroxyDestruxin B[-18] Pro, Ile, NMe-Val, NMe-Ala, bAla, iC5:0-OH(2.3)[-18] 0.45

    Destruxin D[-32] Pro, Ile, NMe-Val, NMe-Ala, bAla, iC5:0-OH(2)-CA(4)[-32] 0.45Destruxin E diol[-20] Pro, Ile, NMe-Val, NMe-Ala, bAla, C4:0-OH(2.3.4)[-20] 0.45

    Destruxin C[-18] Pro, Ile, NMe-Val, NMe-Ala, bAla, iC5:0-OH(2.4)[-18] 0.45Destruxin F[-4] Pro, Ile, NMe-Val, NMe-Ala, bAla, C4:0-OH(2.3)[-4] 0.45Destruxin B[-2] Pro, Ile, NMe-Val, NMe-Ala, bAla, Hiv[-2] 0.45Destruxin E[-2] Pro, Ile, NMe-Val, NMe-Ala, bAla, C4:0-OH(2)-Ep(3)[-2] 0.45

    Destruxin E chlorohydrin[-38] Pro, Ile, NMe-Val, NMe-Ala, bAla, C4:0-OH(2.3)-Cl(4)[-38] 0.45

    Tyrocidine CTyrocidine C D-Phe, Pro, Trp, D-Trp, Asn, Gln, Tyr, Val, Orn, Leu 0.45

    Tyrocidine B[+39] D-Phe, Pro, Trp, D-Phe[+39], Asn, Gln, Tyr, Val, Orn, Leu 0.45Tyrocidine D[-23] D-Phe, Pro, Trp, D-Trp, Asn, Gln, Trp[-23], Val, Orn, Leu 0.45

    Tyrocidine B1 Tyrocidine B[+14] D-Phe, Pro, Trp, D-Phe, Asn, Gln, Tyr, Val, Orn[+14], Leu 0.44Tyrocidine C1 Tyrocidine C[+14] D-Phe, Pro, Trp, D-Trp, Asn, Gln, Tyr, Val, Orn[+14], Leu 0.40Tyrocidine A1 Tyrocidine A[+14] D-Phe, Pro, Phe, D-Phe, Asn, Gln, Tyr, Val, Orn[+14], Leu 0.37

    Tyrocidine BTyrocidine B D-Phe, Pro, Trp, D-Phe, Asn, Gln, Tyr, Val, Orn, Leu 0.37

    Tyrocidine A[+39] D-Phe, Pro, Phe[+39], D-Phe, Asn, Gln, Tyr, Val, Orn, Leu 0.37Tyrocidine C[-39] D-Phe, Pro, Trp, D-Trp[-39], Asn, Gln, Tyr, Val, Orn, Leu 0.37

    Tyrocidine ATyrocidine A D-Phe, Pro, Phe, D-Phe, Asn, Gln, Tyr, Val, Orn, Leu 0.33

    Tyrocidine B[-39] D-Phe, Pro, Trp[-39], D-Phe, Asn, Gln, Tyr, Val, Orn, Leu 0.33Compound 879 Neoviridogrisein (Thr+Hpa), NMe-Ph-Gly, Ala, NMe-bMe-Leu, NMe-Gly, D-4OH-Pro, D-Leu 0.28

    H8405 Beauverolide Ka[-18] C10:0-Me(4)-OH(3), Trp, Phe[-18], D-aIle 0.27BQ123 Halipeptin B[-20] C10:0-Me(2.2.4)-OH(3.7), Ala, aMe-Cys[-20], NMe-OH-Ile, Ala 0.26

    H3526hymenistatin I Pro, Tyr, Val, Pro, Leu, Ile, Ile, Pro 0.25hymenamide G Pro, Tyr, Val, Pro, Leu, Ile, Leu, Pro 0.25

    Cyanopeptide XMajusculamide C[-30] Map, Ala, Ibu, NMe-OMe-Tyr[-30], NMe-Val, Gly, NMe-Ile, Gly, Hmp 0.23

    Dolastatin 11[-30] Gly, NMe-Val, NMe-OMe-Tyr[-30], Ibu, Ala, Map, Hmp, Gly, NMe-Leu 0.23

    Microcystin LRMicrocystin LR D-Ala, Leu, D-bMe-Asp, Arg, Adda, D-Glu, NMe-Dha 0.20

    [Dha7]microcystin-LR[+14] D-Ala[+14], Leu, D-bMe-Asp, Arg, Adda, D-Glu, dh-Ala 0.20Microcystin LAib[+71] D-Ala, Leu, D-bMe-Asp[+71], Aib, Adda, D-Glu, NMe-Dha 0.19

    Seglitide Microsclerodermin F[-3] C12:3(7.9.11)-Me(6)-OH(2.4.5)-NH2(3)-Ph(12), Pyr[-3], NMe-Gly, D-Trp, Gly, OH-4Abu 0.13Cyclomarin C Aureobasin C[-60] D-Hmp, NMe-Val, Phe, NMe-Phe, Pro, Val, NMe-Val, Leu, bOH-NMe-Val[-60] 0.13Cyclomarin A Aureobasidin F[-44] D-Hmp, NMe-Val[-44], Phe, NMe-Phe, Pro, aIle, Val, Leu, bOH-NMe-Val 0.12

    Dehydrocyclomarin A Hymenamide J[-74] Pro, Tyr, Asp, Phe, Trp[-74], Lys, Val, Tyr 0.12Dehydrocyclomarin C PF1022E[+44] D-Lac, NMe-Leu, 4OH-D-Ph-Lac, NMe-Leu, D-Lac, NMe-Leu[+44], D-Ph-Lac, NMe-Leu 0.11

    Table 1: NRP-Dereplication results. The Score is defined as the product of the fraction of explained intensity and the fraction of explained fragmentmasses of a dereplicated peptide. Dereplicated matches have monomers (shown in red) where the candidate mutation is placed with the integer mass

    of the offset enclosed in square brackets (Dereplicated Compounds column). See Table A-3 for the complete list of monomers. Compounds thatare in the database (tyrocidine A, B, C, H3526, microcystin LR and compound 879) or have a closely related compound (tyrocidines A1, B1, C1,

    cyanopeptide X, destruxin A) have higher scores than compounds that are not in the database (seglitide, cyclomarin A, C and dehydrocyclomarin A,

    C). Dereplicated compounds have the mass difference of the experimental spectrum and the mass of the peptide enclosed in square brackets next totheir name (Top Matches column). The compounds are sorted by score and the double horizontal line separates compounds in the database (or have

    a close match) from the compounds that are not in the database (lower part of the table). Compounds H8405 and BQ123 (representing the shortest

    peptides in the sample) returned incorrect matches (false positives). However, a close examination of the results revealed that these false positives

    are nevertheless correlated with the correct peptide sequences. For H8405, the correct sequences is [113, 71, 129, 186, 113], while the database match

    is [184, 186, 129, 113]. For BQ123, the correct masses are [113, 186, 115, 97, 99], while the database match is [71, 228, 71, 97, 143].

    5

  • NRP-Dereplication

    • Compound 879 was thought to be novel, but the compound neoviridogrisein was in NORINE*

    • Cyanopeptide X was unknown in 2007, but majusculamide C was in the NORINE*. The compound was desmethoxymajusculamide C

    *Caboche et al, 2008

  • Cyclic Peptide Identification Problem(De novo reconstruction)

    • Input: MS3 spectrum of a cyclic peptide• Output: A ranked list of peptide

    reconstructions sorted by a scoring

    Similar to the Partial Digest Problem described by Skiena et al 1990. Shown to be NP-Hard for noisy inputs (Cielebak et al 2005)

    Similar to the problem of sequencing linear peptides with internal fragments. Shown to be NP-Hard (Xu and Ma 2006)

  • Tag Generation ProblemNRP-Tagging

    • Input: MS3 spectrum of a cyclic peptide• Output: A ranked list of gapped sequences

    that explains the MS3 spectrum, sorted by a scoring function

    99, 114, 113, 147, 97, 147, 147, 114, 128, 16399, 114, [113+147], [97+147], 147, 114, 128, 163

    99, 114, 260, 244, 147, 114, 128, 163

  • NRP-Tagging

  • A B C D E

    A B C DF

    A B C DF

    F

    E

    E

    NRP-Tagging

  • A B C D E

    A B C DF

    A B C DF

    F

    E

    E

    NRP-Tagging

  • Tag Generation

    A B C D E

    A B C DF

    A B C DF

    F

    E

    F

    E

    E

    NRP-Tagging

  • A B C D E

    A B C DF

    A B C DF

    F

    E

    F

    E

    E

    A B C D E F

    NRP-Tagging

  • bins = []For each peak Pi For each peak Pj (i < j) peak_diff = Pj - Pi bins[peak_diff]++

    Input: A mass spectrum

    Output: A histogram of mass difference counts for a range of masses

    Pevzner et al 2001

    Single Self-Convolution

  • • Input: A mass spectrum• Output: A histogram of 2 consecutive mass

    differences counts for a range of masses

    bins = []

    For each peak Pi

    For each peak Pj (i < j)

    For each peak Pk (j < k)

    peak_diff_1 = Pj - Pi

    peak_diff_2 = Pk - Pj

    bins[peak_diff_1, peak_diff_2]++

    Double Self-Convolution

  • • Self Double Convolution keeping track of the starting peak of each peak triplet

    A B C D E

    A B C DF

    A B C DF

    F

    E

    F

    E

    E

    bins[B, C] = 3

    NRP-Tagging

  • bins = double_convolution(S)for m_a, m_b in bins starts = starting positions of bin[m_a, m_b] for all combinations such that it is a subset of starts m_1 = c_1 m_i = c_i - c_j (j = i - 1) r = parent - c_n - m_a - m_b tag = [m_1, ... m_n, m_a, m_b, r] score(tag), store(tag)

    A B C DE F

    NRP-Tagging

  • bins = double_convolution(S)for m_a, m_b in bins starts = starting positions of bin[m_a, m_b] for all combinations such that it is a subset of starts m_1 = c_1 m_i = c_i - c_j (j = i - 1) r = parent - c_n - m_a - m_b tag = [m_1, ... m_n, m_a, m_b, r] score(tag), store(tag)

    m_3 m_a m_b rm_1 m_2c_1 c_2 c_3 parent

    NRP-Tagging

  • A B C D E

    A B C D

    A B CD

    E

    E

    A B1 CD E B2

    Gap Closing

  • Input: MS3 spectrum S of an (unknown) cyclic peptide, a minimum tag frequency, a recursion depth,and a scoring function score(S, peptide).Output: Ranked list of candidate gapped peptides

    1. Find all tags in S:

    tags(x, y) = {} for all 0 ≺ x, y ≺ 200for all s, s�, s�� ∈ S such that si ≺ sj ≺ sk do

    mass1 = s� − smass2 = s�� − s�add s to tags(mass1,mass2)

    end for

    2. Generate gapped peptides from frequent tags:

    gappedPeptides = {}for all mass1,mass2 with |tags(mass1,mass2)| > frequency do

    for all {0 ≺ s1 ≺ . . . ≺ sn ≺ mass(S)−mass1 −mass2} ⊆ tags(mass1,mass2) dogappedPeptide = [m1, ..., mn,mass1,mass2, mn+1] where mi = si − si−1, for 2 ≤ i ≤ n,m1 = s1 and mn+1 = mass(S)−mass1 −mass2 − snAdd gappedPeptide to gappedPeptides

    end for

    end for

    3. Iteratively attempt to split masses larger than 200 Da:

    results = depth top-scoring peptides from gappedPeptidescandidates = resultsrepeat

    sequences = {}for all gappedPeptide in candidates do

    intermediates = {}for all mass > 200 Da in gappedPeptide do

    for all mass1 such that 0 ≺ mass1 ≺ 200 Da dosplit mass in gappedPeptide into (mass1,mass−mass1) and add the resulting pep-tide to intermediates

    end for

    end for

    add depth top-scoring peptides from intermediates to sequencesend for

    candidates = sequencesAdd sequences to results

    until sequences is emptyreturn results

    Figure A-3: NRP-Tagging algorithm. tags(mass1,mass2) contains the starting positions of all tags formedby amino acids with masses mass1 and mass2. The notation |tags(mass1,mass2)| refers to the number oflocations of a 2-amino acid tag with masses (mass1,mass2). The notation x ≺ y denotes that y − x ≥57 (57 Da represents the mass of the smallest amino acid Gly). For a given set of starting positions intags(mass1,mass2), all possible combinations ({s1 ≺ . . . ≺ sn} ⊆ tags(mass1,mass2)) of starting positionsof tags are considered during the gapped peptide reconstruction. The precursor mass of S is denotedas mass(S). While the pseudocode above attempts to split each mass > 200 Da into all possible pairs(mass1,mass − mass1 with 0 ≺ mass1 ≺ 200, the real implementation only considers mass1 as a splittingmass if it is supported by some peaks in S. There are 2 threshold parameters, frequency (minimum numberof occurrences of a tag in S), and depth (limits the number of high scoring gapped peptides per an iterationof the mass splitting). The scoring function score(S, peptide) is used to rank the intermediate peptides andselect those for the next iteration.

    19

    NRP-Tagging

  • Compound Best reconstruction RankTyrocidine A 99, 114, 113, 147, 97, 147, 147, 114, 128, 163 3

    Tyrocidine A1 99, 128, 113, 147, 97, 147, 147, 114, 128, 163 16

    Tyrocidine B 99, 114, 113, 147, 97, 186, 147, 114, 128, 163 4

    Tyrocidine B1 99, 128, 113, 147, 97, 186, 147, 114, 128, 163 1

    Tyrocidine C 99, 114, 113, 147, 97, 186, 186, 114, 128, 163 4

    Tyrocidine C1 99, 128, 113, 147, 97, 186, 186, 114, 128, 163 1

    Seglitide 85, 163, 186, 128, 99, 147 1

    Cyanopeptide X 57, 113, 161, 141, 71, 113, [114+57], 127 1

    BQ123 113, 186, 115, 97, 99 2

    Destruxin A 113, 113, 85, 71, [98+97] 2

    H3526 97, 97, 163, 99, {97+1}, 113, {113-1}, 113 10H8405 129, 71, 113, 113, 186 2

    Microcystin LR {[83+71]+1}, {113-1}, {129-1}, {156+1}, 313, 129 27Compound 879 113, 113, , {147+18}, 71, 141, 71 7Cyclomarin A 127, 139, , 143, 71, [177+99] 10

    Dehydrocyclomarin A 127, 139, 268, 143, 71, 177, 99 27

    Cyclomarin C 127, 139, 270, {143+32}, {[71+177]-32}, 99 >40Dehydrocyclomarin C Not generated -

    Table 2: NRP-Tagging results. The reconstructed NRPs are represented as sequences of masses. For the

    sake of brevity, masses are rounded to integers, e.g. NRP-Tagging reconstruction for Tyrocidine A is 99.06,

    114.07, 113.07, 147.06, 97.05, 147.05, 147.05, 114.06, 128.03, 163.06, which is more accurate that the integer

    representation given in the first row of the Table. Composite masses (2 or more amino acids) are enclosed

    in square brackets. For example, [114+57] in cyanopeptide X means that NRP-Tagging returned 171 as

    the mass of an amino acid instead of the correct masses 114 and 57 (Hmp and Gly). Incorrect masses

    are enclosed in curly brackets and expressed in terms of their offses from correct masses. For example,{97+1} in H3526 means that NRP-Tagging returned 98 while the correct mass is 97 (Pro). In this case theisotopic peak (rather than a b-ion) was chosen as the best spectral interpretation. Lastly, cases in which the

    algorithm splits a mass are enclosed in angle brackets with the correct mass followed by the masses returned

    by the algorithm. A single mass 286 in cyclomarin A is split as 129, 157. A single mass 222-18 (water loss)

    in compound 879 is split into 100 and 104. The reconstructions given in the table represent a complete

    reconstruction of the compound, or a reconstruction with composite masses and/or masses with a known

    offset. The “Best reconstruction” column presents the high-scoring peptide with a specified rank (“Rankcolumn”) that is selected from the list of all top-scoring peptides as the most similar to the correct peptide.

    7

    NRP-Tagging Results

  • • De novo sequencing of cyclic peptide spectra using self-alignment

    NRP-Sequencing

  • A+14 Y W K V F

    A+14Y W K V F

    A+14 YW K V F

    A+14 Y WK V F

    A+14 Y W KV F

    A+14 Y W K VF

    A+14 Y W K V F

    6 linear theoretical spectra of seglitide

  • A+14 Y W K V F

    A+14Y W K V F

    A+14 YW K V F

    A+14 Y WK V F

    A+14 Y W KV F

    A+14 Y W K VF

    A+14 Y W K V F

    A+14

    Y

    W

    K

    V

    F

    Prefixes are horizontal linesSuffixes are vertical lines

  • A+14 Y W K V F

    A+14Y W K V F

    A+14 YW K V F

    A+14 Y WK V F

    A+14 Y W KV F

    A+14 Y W K VF

    A+14 Y W K V F

    A+14

    Y

    W

    K

    V

    F

    Theoretical spectrum without annotations

  • A+14 Y W K V F

    A+14Y W K V F

    A+14 YW K V F

    A+14 Y WK V F

    A+14 Y W KV F

    A+14 Y W K VF

    A+14 Y W K V F

    A+14

    Y

    W

    K

    V

    F

    Y W K V F

    YWKVFOffset: 85

  • De novo sequence (anti symmetric path: Chen et al 2001)

  • • Self-alignment of spectrum using the highest scoring self-convolution value

    • Use standard de novo reconstruction algorithms for linear peptide sequencing

    • Rescore candidate reconstructions using MSn data

    NRP-Sequencing

  • 0

    50

    100

    150

    200

    Cou

    nt

    0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28Scores

    Figure 4: NRP-Dereplication score distribution (search of compound 879 against NORINE) features excellent

    separation between correct (score 0.28) and false (scores below 0.05) hits.

    Compound Best reconstruction RankTyrocidine A [163+99], 114, [113+147], [147+147], 147, [114+128] 1

    Tyrocidine A1 [163+99], 128, [113+147], [147+147], 147, [114+128] 1

    Tyrocidine B [163+99], 114, [113+147], 97, [186+147], 114, 128 14

    Tyrocidine B1 99, 128, [113+147], [97+186], 147, [114+128] 1

    Tyrocidine C 113, 147, 97, 186, 186, 114, [128+163], [99+114] 125

    Tyrocidine C1 [163+99], [128+113], 147, [97+186], 186, [114+128] 1

    Seglitide 85, [163+186], 128, 99, 147 1

    Cyanopeptide X 57, 113, 161, 141, 71, [113+114+57], 127 1

    BQ123 113, 186, 115, [97+99] 1

    H3526 97, [97+163], 99, [97+113], 113, 113 2

    H8405 129, 71, 113, 113, 186 1

    Table 2: NRP-Sequencing results. The reconstructed NRPs are represented as sequences of masses. For the

    sake of brevity, masses are rounded to integers. Composite masses (2 or more aa) are enclosed in square

    brackets. For example, [163+99] in tyrocidine A means that NRP-Sequencing returned 262 (composite mass

    of 163 and 99 (Tyr and Val)). Best reconstruction is the highest scoring completely correct (i. e. no incorrect

    b-ions) de novo sequence returned by NRP-Sequencing.

    masses. For experimental spectrum of seglitide, the auto-alignment spectrum S85 contains all prefixand suffix (b/y) ions for the peptide YWKVF (x = 85 corrresponds to the most prominent peak inauto-convolution Conv(S, x)).

    • De novo peptide sequencing. We solve the de novo peptide sequencing problem for the auto-alignment spectrum using the anti-symmetric path algorithm [4]. NRP-Sequencing generates all de

    novo peptide reconstructions of Sx (for each of the top t auto-convolution masses x, where t is aparameter) with scores above p ·Score(P ), where p is a parameter and P is the highest scoring de novoreconstruction of Sx. We observed that t = 2 works well in most cases.

    • Re-ranking candidate peptides using MSn spectra. NRP-Sequencing further scores each can-didate peptide by matching all MSn spectra against it and re-ranking candidate peptides according to

    their matches to the MSn spectra. Peaks in de novo reconstructions were scored against MSn spectra

    using a likelihood scoring scheme as described in [5]. De novo sequences derived from TOF MS3 spectra

    were also cyclized and scored against the MS3 spectrum; MS3/MSn match scores and matched peak

    intensities were combined using linear discriminant analysis.

    The pseudocode for NRP-Sequencing is presented in Figure 6. Results of NRP-Sequencing are in Table 2.

    7

    NRP-Sequencing Results

  • ConclusionsDe novo Reconstructions

  • ConclusionsA de novo Reconstruction

  • ConclusionsCombining Reconstructions

  • Acknowledgments

    • Computer Science Department, UCSD: Nuno Bandeira and Pavel Pevzner

    • Department of Chemistry and Biochemistry, UCSD: Wei-Ting Liu, Dario Meluzzi, Majid Ghassemian and Pieter Dorrestein

    • Scripps Institution of Oceanography, UCSD: Marcelino Gutierrez, Thomas Simmons, Andrew Schultz, Bradley Moore, William Gerwick, William Fenical and Katherine Maloney.

    • Skaggs School of Pharmacy and Pharmaceutical Sciences, UCSD: Bradley Moore, William Gerwick and Pieter Dorrestein.

    • Department of Chemistry, UCSC: Roger Linington

    • Computer Science Laboratory of Lille, USTL: Gregory Kucherov and the NORINE team

  • Demo

    • http://lol.ucsd.edu/ms-cpa_v1/Input.py (annotation only)• http://rofl.ucsd.edu/nrp (annotation and identification)• http://lmao.ucsd.edu/nrp (alpha site)