trilogy_discovery of sequence–structure patterns across diverse proteins

Upload: amit-goyal

Post on 04-Jun-2018

223 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/13/2019 TRILOGY_Discovery of sequencestructure patterns across diverse proteins

    1/12

    AMIT

  • 8/13/2019 TRILOGY_Discovery of sequencestructure patterns across diverse proteins

    2/12

    Main Focus

    TRILOGY, a new computer program, for the automated discovery of sequepatterns in proteins.

    TRILOGY implements a pattern discovery algorithm that begins with an exhaustflexible three-residue patterns; a subset of these patterns are selected as extension process in which longer patterns are identified.

    Key feature of the method is explicit treatment of both the sequence acomponents of these motifs.

    TRILOGY identifies several thousand high-scoring patterns that occur across pr

    The sequencestructure patterns will be useful in predicting protein structure fannotating newly determined protein structures, and identifying novel motifsfunctional or structural significance.

  • 8/13/2019 TRILOGY_Discovery of sequencestructure patterns across diverse proteins

    3/12

    Introduction

    In last two decades, considerable effort has been invested in the search fosequence that correlate with patterns in structure.

    This search has traditionally proceeded by the identification of a structural mby alignment of sequences corresponding to the motif and calculation ofrequencies at positions in this structural alignment.

    Amino acid sequence preferences associated with protein secondary structurehelix caps, and supersecondary motifs such as the coiled-coil have beenidentified.

    One limitation of this approach is the fact that the motifs in question must badvance.

  • 8/13/2019 TRILOGY_Discovery of sequencestructure patterns across diverse proteins

    4/12

    Introduction

    In present research, an approach to pattern discovery that succeeds in identifyunknown sequencestructure patterns across diverse protein families, is introduc

    The algorithm is unsupervised in that the motifs are not specified in advance.

    A key feature of our approach is the explicit treatment of both the sequence components of these motifs as independent patterns that are identified simultaneously during the search process.

    A statistical score is assigned by measuring the degree of correlation betweenstructure patterns.

    Each pattern is required to span at least three SCOP superfamilies; thus tcannot be easily explained by sequence similarity between the matched proteins.

    The high-scoring patterns represent known motifs of structural and functiona(helix capping patterns; an NAD/FAD binding pattern), as well as potentially nove

  • 8/13/2019 TRILOGY_Discovery of sequencestructure patterns across diverse proteins

    5/12

    Keywords: Triple Patterns

    The basic pattern objects are

    termed triple patterns whichare are sequencestructurepatterns with three specifiedresidues.

    A triple pattern P has twocomponents: a structurepattern Pstr, specifying therelative three-dimensionalarrangement and orientation ofthe three residues, and asequence pattern Pseq thatdefines the sequence spacingand residue type of the threepattern residues.

    The matches to a triple patternP in the structure set are thematches common to both Pstrand Pseq.

  • 8/13/2019 TRILOGY_Discovery of sequencestructure patterns across diverse proteins

    6/12

    Keywords: Longer Pattern

    Sequencestructure patternswith more than three residuesare formed by gluingtogether several triplepatterns.

    Total length of triplesequence patterns is limited

    by the range of allowed gaplengths,

  • 8/13/2019 TRILOGY_Discovery of sequencestructure patterns across diverse proteins

    7/12

    Keywords: Significance Score

    The significance score, or P-score, attached to a sequencestructure pattern P mthe degree of correlation between the individual sequence and structure patternPstr.

    It equals the likelihood of seeing an equal or greater number of matches commonand Pstr.

    The P-score allows us to attach a significance to each pattern, even though signifthreshold to the P-scoresmust be evaluated in relation to the size of the patterspace.

  • 8/13/2019 TRILOGY_Discovery of sequencestructure patterns across diverse proteins

    8/12

    Search Algorithm

    The search for significant sequencestructure patterns proceeds in three stages:

    1. In the first stage, all triplets of conserved residues that are nearby in three dextracted from the structure set. Three residues are considered to be distance of closest approach between each pair of residues is less than a pcutoff.

    2. the second stage of the pattern search, assign significance scoresto a largepatterns and choose a subset of these as seeds for the pattern extension proce

    3. In the final stage, search for significant patterns with more than three

    extending the triple patterns identified in the previous stage is done.The search proceeds by repeatedly adding a new residue to an existing

    terminates when no extension can be made without reducing the number of msuperfamilies to fewer than three.

  • 8/13/2019 TRILOGY_Discovery of sequencestructure patterns across diverse proteins

    9/12

    Structure Set

    The structure set analyzed in this work consisted of a single protein domain frfamily of proteins, i.e. around 1557 families.

    Atomic coordinates for all domains were extracted from corresponding Protefiles while the sequence homologs were taken from the HSSP database.

    An alignment was constructed by trimming these homologs in the neighborhoodregions of low similarity to the representative protein.

  • 8/13/2019 TRILOGY_Discovery of sequencestructure patterns across diverse proteins

    10/12

    Results

    In general, short patterns identified by TRILOGY correspond to supersecondarystructures (helix caps, -turns, --units), whereas longer patterns represent motifs or possible evidence of distant evolutionary relatedness.

    Table 1 summarizes the results of TRILOGY searches for three values of the intcontact distance threshold.

  • 8/13/2019 TRILOGY_Discovery of sequencestructure patterns across diverse proteins

    11/12

    Results

    In the following subsections eight of the patternsare described in more detail.

    Supersecondary Structural Motifs: between unpaired -strands)Functional Patterns: Pattern 2, 3 & Cysteine Patterns: Pattern 5Repeat Patterns: pattern 6 (leucineStructural Similarities: Pattern 7 &

  • 8/13/2019 TRILOGY_Discovery of sequencestructure patterns across diverse proteins

    12/12

    Conclusion

    The TRILOGY algorithm successfully identifies known and novel motifs when app

    representative set of protein domains. TRILOGY also identifies sequencestructure patterns for which a clear biologica

    is not apparent; this allows the generation of new hypotheses regarding functionstructural significance that can then be tested experimentally.

    Central to the algorithms success are the triple-pattern representation and a stwell founded significance score that highlights potentially interesting motifs.

    This view is supported on a small scale by the identification of several interestin

    matches in proteins from the recently solved crystal structure of the 30S riboso