discovery of functional protein linear motifs using a greaddy algorithm and information theory

DISCOVERY OF FUNCTIONAL PROTEIN LINEAR MOTIFS USING A GREEDY ALGORITHM AND INFORMATION THEORY

LEANDRO G. RADUSKY§, JULIANA GLAVINA§, MARIA FATIMA LADELFA¶, MARTIN MONTE¶

AND IGNACIO E. SANCHEZ§ §PROTEIN PHYSIOLOGY LABORATORY, DEPARTAMENTO DE QUIMICA BIOLOGICA, FACULTAD DE CIENCIAS EXACTAS Y NATURALES-‐UNIVERSIDAD DE BUENOS AIRES, ARGENTINA ¶MOLECULAR

AND CELL BIOLOGY LABORATORY, DEPARTAMENTO DE QUIMICA BIOLOGICA, FACULTAD DE CIENCIAS EXACTAS Y NATURALES-‐UNIVERSIDAD DE BUENOS AIRES, ARGENTINA .

CONCLUDING REMARKS •  We have implemented an algorithm for the discovery of novel protein

functional motifs within sets of unaligned sequences.

•  The algorithm shows good performance in the recovery of known motifs.

•  We propose a putative motif responsible for localization of MAGE proteins in the nucleolus.

INTRODUCTION The molecular basis of many protein-protein interactions reported in the literature is unknown, especially for those observed in high-throughput studies [1]. Many globular domains bind in a specific manner to short (5-15 residues) sequences embedded within intrinsically disordered regions, the so-called “linear motifs” [1]. It is likely that recognition of yet unknown linear motifs lies behind many protein-protein complexes of biological interest. We present an algorithm that extracts linear motifs from protein-protein interaction datasets.

VALIDATION: SEARCH FOR KNOWN MOTIFS

We have tested the ability of our algorithm to identify known functional linear motifs in sequence sets taken from the ELM database [6].

ALGORITHM

CASE STUDY: NUCLEOLAR LOCALIZATION OF MAGE PROTEINS

The MAGE (melanoma-associated antigen) family of proteins are plausible targets for anticancer therapy [7]. The MAGE-A2 protein localizes to the nucleus, while the MAGE-B2 protein is observed in both the nucleus and the nucleolus.

Our algorithm extracted a putative nucleolar localization motif from a database of nucleolar proteins [8,9]. The motif matches the Lys/Arg-rich N-terminus of MAGE-B2 (red) but not of MAGE-A2. A truncated MAGE-B2 variant that retains the motif localizes to the nucleolus.

REFERENCES [1] Neduva V et al. Systematic discovery of new recognition peptides mediating protein interaction networks. PLoS Biology 2005, 3:e405. [2] Huang Y et al. CD-HIT Suite: a web server for clustering and comparing biological sequences. Bioinformatics 2010, 26:680-682. [3] Obradovic Z et al. Exploiting heterogeneous sequence properties improves prediction of protein disorder. Proteins 2005, 61:S176-182. [4] Stormo GD, Hartzell GW 3rd. Identifying protein-binding sites from unaligned DNA fragments. Proc Natl Acad Sci U S A. 1989, 86:1183-1187. [5] Schneider TD, Stephens RM. Sequence logos: a new way to display consensus sequences. Nucleic Acids Res. 1990 Oct 25;18(20):6097-100. [6] Gould CM et al. ELM: the status of the 2010 eukaryotic linear motif resource. Nucleic Acids Res. 2010 Jan;38(Database issue):D167-80. [7] Simpson AJ et al. Cancer/testis antigens, gametogenesis and cancer. Nat Rev Cancer, 2005, 5: 615-625 [8] Emmot E, Hiscox JA Nucleolar targeting: the hub of the matter. EMBO Rel 2009 10(3):231-8. [9] Scott MS et al. Characterization and prediction of protein nucleolar localization sequences. Nucleic Acids Res. 2010 Nov 1;38(21):7388-99.

1. DATASET

The algorithm takes as input the sequence of all the protein targets bound by the protein under study.

The hypothesis is that any linear motif mediating the interaction will be overrepresented in the sequence of these proteins.

The user also determines the length of the putative linear motif to be looked for, e.g., ten residues.

Protein under study

Physically interacts with

Several Protein targets

2. INPUT FILTERS

1.  The presence of homologous proteins in the dataset would lead to spurious motif overrepresentation. We use the CD-HIT algorithm [2] to identify this kind of redundancy and remove it from the input.

2.  Most functional linear motifs are located within disordered protein domains [1]. Disordered regions are identified using the VSL software [3] and kept for analysis.

3. MOTIF SEARCH

Our software is an adaptation of a method used for motif search in DNA sequences [4], implemented in Python.

It first calculates all possible alignments of two k-words in the dataset.

Next, we offer all possible k-words to each growing alignment and incorporate the one resulting in the highest score.

We repeat this procedure until incorporation of new k-words does not increase the score of any alignment.

Last, we sort the alignments by their scores. The sorted list is the output of the search.

input

Matrix M: sequences to be analyzed Integer L: motif length

output

Matrix Res: All k-word alingments

Algorithm

{ M’ = ObtainAllKWords(M) Res = CreateAlignmentsOfTwoKWords (M’) While (Res) has changed { CurrentKWordss = ObtainAllKWords (M) For all alignments A in Res { AddBestKword (A, CurrentKwords) } } SortByScore (Res) Print Res }

RESULTS

5. OUTPUT

We measure the similarity between two motifs as the Pearson correlation coefficient R between the corresponding amino acid frequencies. The group alignments above the desired value of R. Finally, we use sequence logos [4] to picture the motifs in the highest scoring alignments.

4. MOTIF SCORING

We use the information content [5] of each alignment to quantify the overrepresentation of the motif contained in each sequence alignment.

H(l) = -Σ f(aa,l) log2 f(aa,l) (bits)

Rsequence(l) = log220 + Σ f(aa,l) log2 f(aa,l)-e(n) (bits)

Rsequence = Rsequence(l) (bits)

The uncertainty at a position of the alignment is:

The information content at a position is the decrease in uncertainty between a random sequence and the observed sequences, with a correction e(n) for the sampling of a finite number of sequences:

The information content of an alignment is the sum over all positions:

Truncated MAGE-B2-GFP GFP-MAGE-B2 GFP-MAGE-A2

Transfected U2Os cells. Green: GFP tag, blue: DAPI. Magnification 100x.

RxDV

Integrin

RGD

PQE

TRAF6

PxE ELM

Motif

Dilimot

Our method

Not found

NR box

LxLL

Not found

FxIxNI

EH1

Fx(IV)xx(IL)(ILM)

Not found

KVPxVxL

HP1

PxVx(LM)

Not found

ELM

Motif

Dilimot

Our method

ELM

Motif

Dilimot

Our method

R(SFYW)xSxP

14-3-3 type 1

RSxSxP DDxFxxF

Gamma-adaptin

(DE)(DES)xF x(DE)(LVIMFD)

LIxLD

Clathrin box

L(ILM)x (ILMF)(DE)

DGxW

Mannosylation

WxxW

DxPxDL

CtBP

Px(DEN) L(VAST)

KxTQT

Dynein

(QR)xTQT

Our algorithm captures the known motif in six cases (top), suggesting significant sequence specificity in positions marked as “x” in the consensus. There is a partial match with the known consensus in two cases (bottom left) and no match in three cases (bottom right). The performance is comparable to that of Dilimot [1], a similar software that describes motifs as consensus sequences

discovery of functional protein linear motifs using a greaddy algorithm and information theory

Technology

proteinprotein interactions

mageb2 protein

magea2 protein

prediction of protein

protein interaction

proteinbinding sites

unknown linear motifs

known functional linear