discovery of functional protein linear motifs using a greaddy algorithm and information theory

1
DISCOVERY OF FUNCTIONAL PROTEIN LINEAR MOTIFS USING A GREEDY ALGORITHM AND INFORMATION THEORY LEANDRO G. RADUSKY § , JULIANA GLAVINA § , MARIA FATIMA LADELFA , MARTIN MONTE AND IGNACIO E. SANCHEZ § § PROTEIN PHYSIOLOGY LABORATORY, DEPARTAMENTO DE QUIMICA BIOLOGICA, FACULTAD DE CIENCIAS EXACTAS Y NATURALESUNIVERSIDAD DE BUENOS AIRES, ARGENTINA MOLECULAR AND CELL BIOLOGY LABORATORY, DEPARTAMENTO DE QUIMICA BIOLOGICA, FACULTAD DE CIENCIAS EXACTAS Y NATURALESUNIVERSIDAD DE BUENOS AIRES, ARGENTINA . CONCLUDING REMARKS We have implemented an algorithm for the discovery of novel protein functional motifs within sets of unaligned sequences. The algorithm shows good performance in the recovery of known motifs. We propose a putative motif responsible for localization of MAGE proteins in the nucleolus. INTRODUCTION The molecular basis of many protein-protein interactions reported in the literature is unknown, especially for those observed in high-throughput studies [1]. Many globular domains bind in a specific manner to short (5-15 residues) sequences embedded within intrinsically disordered regions, the so-called “linear motifs” [1]. It is likely that recognition of yet unknown linear motifs lies behind many protein-protein complexes of biological interest. We present an algorithm that extracts linear motifs from protein-protein interaction datasets. VALIDATION: SEARCH FOR KNOWN MOTIFS We have tested the ability of our algorithm to identify known functional linear motifs in sequence sets taken from the ELM database [6]. ALGORITHM CASE STUDY: NUCLEOLAR LOCALIZATION OF MAGE PROTEINS The MAGE (melanoma-associated antigen) family of proteins are plausible targets for anticancer therapy [7]. The MAGE-A2 protein localizes to the nucleus, while the MAGE-B2 protein is observed in both the nucleus and the nucleolus. Our algorithm extracted a putative nucleolar localization motif from a database of nucleolar proteins [8,9]. The motif matches the Lys/Arg-rich N-terminus of MAGE-B2 (red) but not of MAGE-A2. A truncated MAGE-B2 variant that retains the motif localizes to the nucleolus. REFERENCES [1] Neduva V et al. Systematic discovery of new recognition peptides mediating protein interaction networks. PLoS Biology 2005, 3:e405. [2] Huang Y et al. CD-HIT Suite: a web server for clustering and comparing biological sequences. Bioinformatics 2010, 26:680-682. [3] Obradovic Z et al. Exploiting heterogeneous sequence properties improves prediction of protein disorder. Proteins 2005, 61:S176-182. [4] Stormo GD, Hartzell GW 3rd. Identifying protein-binding sites from unaligned DNA fragments. Proc Natl Acad Sci U S A. 1989, 86:1183-1187. [5] Schneider TD, Stephens RM. Sequence logos: a new way to display consensus sequences. Nucleic Acids Res. 1990 Oct 25;18(20):6097-100. [6] Gould CM et al. ELM: the status of the 2010 eukaryotic linear motif resource. Nucleic Acids Res. 2010 Jan;38(Database issue):D167-80. [7] Simpson AJ et al. Cancer/testis antigens, gametogenesis and cancer. Nat Rev Cancer, 2005, 5: 615-625 [8] Emmot E, Hiscox JA Nucleolar targeting: the hub of the matter. EMBO Rel 2009 10(3):231-8. [9] Scott MS et al. Characterization and prediction of protein nucleolar localization sequences. Nucleic Acids Res. 2010 Nov 1;38(21):7388-99. 1. DATASET The algorithm takes as input the sequence of all the protein targets bound by the protein under study. The hypothesis is that any linear motif mediating the interaction will be overrepresented in the sequence of these proteins. The user also determines the length of the putative linear motif to be looked for, e.g., ten residues. Protein under study Physically interacts with Several Protein targets 2. INPUT FILTERS 1. The presence of homologous proteins in the dataset would lead to spurious motif overrepresentation. We use the CD- HIT algorithm [2] to identify this kind of redundancy and remove it from the input. 2. Most functional linear motifs are located within disordered protein domains [1]. Disordered regions are identified using the VSL software [3] and kept for analysis. 3. MOTIF SEARCH Our software is an adaptation of a method used for motif search in DNA sequences [4], implemented in Python. It first calculates all possible alignments of two k-words in the dataset. Next, we offer all possible k-words to each growing alignment and incorporate the one resulting in the highest score. We repeat this procedure until incorporation of new k-words does not increase the score of any alignment. Last, we sort the alignments by their scores. The sorted list is the output of the search. input Matrix M: sequences to be analyzed Integer L: motif length output Matrix Res: All k-word alingments Algorithm { M’ = ObtainAllKWords(M) Res = CreateAlignmentsOfTwoKWords (M’) While (Res) has changed { CurrentKWordss = ObtainAllKWords (M) For all alignments A in Res { AddBestKword (A, CurrentKwords) } } SortByScore (Res) Print Res } RESULTS 5. OUTPUT We measure the similarity between two motifs as the Pearson correlation coefficient R between the corresponding amino acid frequencies. The group alignments above the desired value of R. Finally, we use sequence logos [4] to picture the motifs in the highest scoring alignments. 4. MOTIF SCORING We use the information content [5] of each alignment to quantify the overrepresentation of the motif contained in each sequence alignment. H(l) = -Σ f(aa,l) log 2 f(aa,l) (bits) R sequence (l) = log 2 20 + Σ f(aa,l) log 2 f(aa,l)-e(n) (bits) R sequence = R sequence (l) (bits) The uncertainty at a position of the alignment is: The information content at a position is the decrease in uncertainty between a random sequence and the observed sequences, with a correction e(n) for the sampling of a finite number of sequences: The information content of an alignment is the sum over all positions: Truncated MAGE-B2-GFP GFP-MAGE-B2 GFP-MAGE-A2 Transfected U2Os cells. Green: GFP tag, blue: DAPI. Magnification 100x. RxDV Integrin RGD PQE TRAF6 PxE ELM Motif Dilimot Our method Not found NR box LxLL Not found FxIxNI EH1 Fx(IV)xx(IL)(ILM) Not found KVPxVxL HP1 PxVx(LM) Not found ELM Motif Dilimot Our method ELM Motif Dilimot Our method R(SFYW)xSxP 14-3-3 type 1 RSxSxP DDxFxxF Gamma-adaptin (DE)(DES)xF x(DE)(LVIMFD) LIxLD Clathrin box L(ILM)x (ILMF)(DE) DGxW Mannosylation WxxW DxPxDL CtBP Px(DEN) L(VAST) KxTQT Dynein (QR)xTQT Our algorithm captures the known motif in six cases (top), suggesting significant sequence specificity in positions marked as “x” in the consensus. There is a partial match with the known consensus in two cases (bottom left) and no match in three cases (bottom right). The performance is comparable to that of Dilimot [1], a similar software that describes motifs as consensus sequences

Category:

Technology


0 download

TRANSCRIPT

Page 1: Discovery Of Functional Protein Linear Motifs Using a Greaddy Algorithm and Information Theory

DISCOVERY  OF  FUNCTIONAL  PROTEIN  LINEAR  MOTIFS  USING  A  GREEDY  ALGORITHM  AND  INFORMATION  THEORY  

LEANDRO  G.  RADUSKY§,  JULIANA  GLAVINA§,  MARIA  FATIMA  LADELFA¶,  MARTIN  MONTE¶    

AND  IGNACIO  E.  SANCHEZ§  §PROTEIN  PHYSIOLOGY  LABORATORY,  DEPARTAMENTO  DE  QUIMICA  BIOLOGICA,  FACULTAD  DE  CIENCIAS  EXACTAS  Y  NATURALES-­‐UNIVERSIDAD  DE  BUENOS  AIRES,  ARGENTINA  ¶MOLECULAR  

AND  CELL  BIOLOGY  LABORATORY,  DEPARTAMENTO  DE  QUIMICA  BIOLOGICA,  FACULTAD  DE  CIENCIAS  EXACTAS  Y  NATURALES-­‐UNIVERSIDAD  DE  BUENOS  AIRES,  ARGENTINA  .    

CONCLUDING  REMARKS  •  We have implemented an algorithm for the discovery of novel protein

functional motifs within sets of unaligned sequences.

•  The algorithm shows good performance in the recovery of known motifs.

•  We propose a putative motif responsible for localization of MAGE proteins in the nucleolus.

INTRODUCTION  The molecular basis of many protein-protein interactions reported in the literature is unknown, especially for those observed in high-throughput studies [1]. Many globular domains bind in a specific manner to short (5-15 residues) sequences embedded within intrinsically disordered regions, the so-called “linear motifs” [1]. It is likely that recognition of yet unknown linear motifs lies behind many protein-protein complexes of biological interest. We present an algorithm that extracts linear motifs from protein-protein interaction datasets.  

VALIDATION:  SEARCH  FOR  KNOWN  MOTIFS  

We have tested the ability of our algorithm to identify known functional linear motifs in sequence sets taken from the ELM database [6].

ALGORITHM  

CASE  STUDY:  NUCLEOLAR  LOCALIZATION  OF  MAGE  PROTEINS  

The MAGE (melanoma-associated antigen) family of proteins are plausible targets for anticancer therapy [7]. The MAGE-A2 protein localizes to the nucleus, while the MAGE-B2 protein is observed in both the nucleus and the nucleolus.

Our algorithm extracted a putative nucleolar localization motif from a database of nucleolar proteins [8,9]. The motif matches the Lys/Arg-rich N-terminus of MAGE-B2 (red) but not of MAGE-A2. A truncated MAGE-B2 variant that retains the motif localizes to the nucleolus.

REFERENCES  [1] Neduva V et al. Systematic discovery of new recognition peptides mediating protein interaction networks. PLoS Biology 2005, 3:e405. [2] Huang Y et al. CD-HIT Suite: a web server for clustering and comparing biological sequences. Bioinformatics 2010, 26:680-682. [3] Obradovic Z et al. Exploiting heterogeneous sequence properties improves prediction of protein disorder. Proteins 2005, 61:S176-182. [4] Stormo GD, Hartzell GW 3rd. Identifying protein-binding sites from unaligned DNA fragments. Proc Natl Acad Sci U S A. 1989, 86:1183-1187. [5] Schneider TD, Stephens RM. Sequence logos: a new way to display consensus sequences. Nucleic Acids Res. 1990 Oct 25;18(20):6097-100. [6] Gould CM et al. ELM: the status of the 2010 eukaryotic linear motif resource. Nucleic Acids Res. 2010 Jan;38(Database issue):D167-80. [7] Simpson AJ et al. Cancer/testis antigens, gametogenesis and cancer. Nat Rev Cancer, 2005, 5: 615-625 [8] Emmot E, Hiscox JA Nucleolar targeting: the hub of the matter. EMBO Rel 2009 10(3):231-8. [9] Scott MS et al. Characterization and prediction of protein nucleolar localization sequences. Nucleic Acids Res. 2010 Nov 1;38(21):7388-99.

1.  DATASET

The algorithm takes as input the sequence of all the protein targets bound by the protein under study.

The hypothesis is that any linear motif mediating the interaction will be overrepresented in the sequence of these proteins.

The user also determines the length of the putative linear motif to be looked for, e.g., ten residues.

Protein under study

Physically interacts with

Several Protein targets

2.  INPUT  FILTERS

1.  The presence of homologous proteins in the dataset would lead to spurious motif overrepresentation. We use the CD-HIT algorithm [2] to identify this kind of redundancy and remove it from the input.

2.  Most functional linear motifs are located within disordered protein domains [1]. Disordered regions are identified using the VSL software [3] and kept for analysis.

3.  MOTIF  SEARCH

Our software is an adaptation of a method used for motif search in DNA sequences [4], implemented in Python.

It first calculates all possible alignments of two k-words in the dataset.

Next, we offer all possible k-words to each growing alignment and incorporate the one resulting in the highest score.

We repeat this procedure until incorporation of new k-words does not increase the score of any alignment.

Last, we sort the alignments by their scores. The sorted list is the output of the search.

input

Matrix M: sequences to be analyzed Integer L: motif length

output

Matrix Res: All k-word alingments

Algorithm

{ M’ = ObtainAllKWords(M) Res = CreateAlignmentsOfTwoKWords (M’) While (Res) has changed { CurrentKWordss = ObtainAllKWords (M) For all alignments A in Res { AddBestKword (A, CurrentKwords) } } SortByScore (Res) Print Res }

RESULTS  

5.  OUTPUT  

We measure the similarity between two motifs as the Pearson correlation coefficient R between the corresponding amino acid frequencies. The group alignments above the desired value of R. Finally, we use sequence logos [4] to picture the motifs in the highest scoring alignments.

4.  MOTIF  SCORING

We use the information content [5] of each alignment to quantify the overrepresentation of the motif contained in each sequence alignment.

H(l) = -Σ f(aa,l) log2 f(aa,l) (bits)

Rsequence(l) = log220 + Σ f(aa,l) log2 f(aa,l)-e(n) (bits)

Rsequence = Rsequence(l) (bits)

The uncertainty at a position of the alignment is:

The information content at a position is the decrease in uncertainty between a random sequence and the observed sequences, with a correction e(n) for the sampling of a finite number of sequences:

The information content of an alignment is the sum over all positions:

Truncated MAGE-B2-GFP GFP-MAGE-B2 GFP-MAGE-A2

Transfected U2Os cells. Green: GFP tag, blue: DAPI. Magnification 100x.

RxDV

Integrin

RGD

PQE

TRAF6

PxE ELM

Motif

Dilimot

Our method

Not found

NR box

LxLL

Not found

FxIxNI

EH1

Fx(IV)xx(IL)(ILM)

Not found

KVPxVxL

HP1

PxVx(LM)

Not found

ELM

Motif

Dilimot

Our method

ELM

Motif

Dilimot

Our method

R(SFYW)xSxP

14-3-3 type 1

RSxSxP DDxFxxF

Gamma-adaptin

(DE)(DES)xF x(DE)(LVIMFD)

LIxLD

Clathrin box

L(ILM)x (ILMF)(DE)

DGxW

Mannosylation

WxxW

DxPxDL

CtBP

Px(DEN) L(VAST)

KxTQT

Dynein

(QR)xTQT

Our algorithm captures the known motif in six cases (top), suggesting significant sequence specificity in positions marked as “x” in the consensus. There is a partial match with the known consensus in two cases (bottom left) and no match in three cases (bottom right). The performance is comparable to that of Dilimot [1], a similar software that describes motifs as consensus sequences