discovery of functional protein linear motifs using a greaddy algorithm and information theory
Upload: asociacion-argentina-de-bioinformatica-y-biologia-computacional
Post on 17-Jul-2015
282 views
TRANSCRIPT
DISCOVERY OF FUNCTIONAL PROTEIN LINEAR MOTIFS USING A GREEDY ALGORITHM AND INFORMATION THEORY
LEANDRO G. RADUSKY§, JULIANA GLAVINA§, MARIA FATIMA LADELFA¶, MARTIN MONTE¶
AND IGNACIO E. SANCHEZ§ §PROTEIN PHYSIOLOGY LABORATORY, DEPARTAMENTO DE QUIMICA BIOLOGICA, FACULTAD DE CIENCIAS EXACTAS Y NATURALES-‐UNIVERSIDAD DE BUENOS AIRES, ARGENTINA ¶MOLECULAR
AND CELL BIOLOGY LABORATORY, DEPARTAMENTO DE QUIMICA BIOLOGICA, FACULTAD DE CIENCIAS EXACTAS Y NATURALES-‐UNIVERSIDAD DE BUENOS AIRES, ARGENTINA .
CONCLUDING REMARKS • We have implemented an algorithm for the discovery of novel protein
functional motifs within sets of unaligned sequences.
• The algorithm shows good performance in the recovery of known motifs.
• We propose a putative motif responsible for localization of MAGE proteins in the nucleolus.
INTRODUCTION The molecular basis of many protein-protein interactions reported in the literature is unknown, especially for those observed in high-throughput studies [1]. Many globular domains bind in a specific manner to short (5-15 residues) sequences embedded within intrinsically disordered regions, the so-called “linear motifs” [1]. It is likely that recognition of yet unknown linear motifs lies behind many protein-protein complexes of biological interest. We present an algorithm that extracts linear motifs from protein-protein interaction datasets.
VALIDATION: SEARCH FOR KNOWN MOTIFS
We have tested the ability of our algorithm to identify known functional linear motifs in sequence sets taken from the ELM database [6].
ALGORITHM
CASE STUDY: NUCLEOLAR LOCALIZATION OF MAGE PROTEINS
The MAGE (melanoma-associated antigen) family of proteins are plausible targets for anticancer therapy [7]. The MAGE-A2 protein localizes to the nucleus, while the MAGE-B2 protein is observed in both the nucleus and the nucleolus.
Our algorithm extracted a putative nucleolar localization motif from a database of nucleolar proteins [8,9]. The motif matches the Lys/Arg-rich N-terminus of MAGE-B2 (red) but not of MAGE-A2. A truncated MAGE-B2 variant that retains the motif localizes to the nucleolus.
REFERENCES [1] Neduva V et al. Systematic discovery of new recognition peptides mediating protein interaction networks. PLoS Biology 2005, 3:e405. [2] Huang Y et al. CD-HIT Suite: a web server for clustering and comparing biological sequences. Bioinformatics 2010, 26:680-682. [3] Obradovic Z et al. Exploiting heterogeneous sequence properties improves prediction of protein disorder. Proteins 2005, 61:S176-182. [4] Stormo GD, Hartzell GW 3rd. Identifying protein-binding sites from unaligned DNA fragments. Proc Natl Acad Sci U S A. 1989, 86:1183-1187. [5] Schneider TD, Stephens RM. Sequence logos: a new way to display consensus sequences. Nucleic Acids Res. 1990 Oct 25;18(20):6097-100. [6] Gould CM et al. ELM: the status of the 2010 eukaryotic linear motif resource. Nucleic Acids Res. 2010 Jan;38(Database issue):D167-80. [7] Simpson AJ et al. Cancer/testis antigens, gametogenesis and cancer. Nat Rev Cancer, 2005, 5: 615-625 [8] Emmot E, Hiscox JA Nucleolar targeting: the hub of the matter. EMBO Rel 2009 10(3):231-8. [9] Scott MS et al. Characterization and prediction of protein nucleolar localization sequences. Nucleic Acids Res. 2010 Nov 1;38(21):7388-99.
1. DATASET
The algorithm takes as input the sequence of all the protein targets bound by the protein under study.
The hypothesis is that any linear motif mediating the interaction will be overrepresented in the sequence of these proteins.
The user also determines the length of the putative linear motif to be looked for, e.g., ten residues.
Protein under study
Physically interacts with
Several Protein targets
2. INPUT FILTERS
1. The presence of homologous proteins in the dataset would lead to spurious motif overrepresentation. We use the CD-HIT algorithm [2] to identify this kind of redundancy and remove it from the input.
2. Most functional linear motifs are located within disordered protein domains [1]. Disordered regions are identified using the VSL software [3] and kept for analysis.
3. MOTIF SEARCH
Our software is an adaptation of a method used for motif search in DNA sequences [4], implemented in Python.
It first calculates all possible alignments of two k-words in the dataset.
Next, we offer all possible k-words to each growing alignment and incorporate the one resulting in the highest score.
We repeat this procedure until incorporation of new k-words does not increase the score of any alignment.
Last, we sort the alignments by their scores. The sorted list is the output of the search.
input
Matrix M: sequences to be analyzed Integer L: motif length
output
Matrix Res: All k-word alingments
Algorithm
{ M’ = ObtainAllKWords(M) Res = CreateAlignmentsOfTwoKWords (M’) While (Res) has changed { CurrentKWordss = ObtainAllKWords (M) For all alignments A in Res { AddBestKword (A, CurrentKwords) } } SortByScore (Res) Print Res }
RESULTS
5. OUTPUT
We measure the similarity between two motifs as the Pearson correlation coefficient R between the corresponding amino acid frequencies. The group alignments above the desired value of R. Finally, we use sequence logos [4] to picture the motifs in the highest scoring alignments.
4. MOTIF SCORING
We use the information content [5] of each alignment to quantify the overrepresentation of the motif contained in each sequence alignment.
H(l) = -Σ f(aa,l) log2 f(aa,l) (bits)
Rsequence(l) = log220 + Σ f(aa,l) log2 f(aa,l)-e(n) (bits)
Rsequence = Rsequence(l) (bits)
The uncertainty at a position of the alignment is:
The information content at a position is the decrease in uncertainty between a random sequence and the observed sequences, with a correction e(n) for the sampling of a finite number of sequences:
The information content of an alignment is the sum over all positions:
Truncated MAGE-B2-GFP GFP-MAGE-B2 GFP-MAGE-A2
Transfected U2Os cells. Green: GFP tag, blue: DAPI. Magnification 100x.
RxDV
Integrin
RGD
PQE
TRAF6
PxE ELM
Motif
Dilimot
Our method
Not found
NR box
LxLL
Not found
FxIxNI
EH1
Fx(IV)xx(IL)(ILM)
Not found
KVPxVxL
HP1
PxVx(LM)
Not found
ELM
Motif
Dilimot
Our method
ELM
Motif
Dilimot
Our method
R(SFYW)xSxP
14-3-3 type 1
RSxSxP DDxFxxF
Gamma-adaptin
(DE)(DES)xF x(DE)(LVIMFD)
LIxLD
Clathrin box
L(ILM)x (ILMF)(DE)
DGxW
Mannosylation
WxxW
DxPxDL
CtBP
Px(DEN) L(VAST)
KxTQT
Dynein
(QR)xTQT
Our algorithm captures the known motif in six cases (top), suggesting significant sequence specificity in positions marked as “x” in the consensus. There is a partial match with the known consensus in two cases (bottom left) and no match in three cases (bottom right). The performance is comparable to that of Dilimot [1], a similar software that describes motifs as consensus sequences