bioinformatics approaches to identifying candidate effector molecules of s. typhimurium matthew...

Bioinformatics Approaches to Identifying Candidate Effector Molecules of S. typhimurium

Matthew Sylvester

12/1/03

Endocytic Trafficking

Lysosome

Salmonella-containing vacuole

Lamp-1H+ ATPaseCathepsinsTransferrin RMan. 6-PR

bacterial effectorproteins (SseJ, SifA,SseXs, and severalothers)

Selection of S. typhimurium Proteins

• Salmonella effectors are secreted into the host cell via either the Salmonella pathogenicity island 1 (SPI1) or SPI2 type three secretion system (TTSS)

• We chose only those proteins shown experimentally in the literature to go out through one or both of these systems

(see PubMed at http://ncbi.nlm.nih.gov)• The seventeen identified SPI1 and SPI2-associated effectors

were considered as one group for subsequent analysis• As the N-terminal 150 amino acids have been shown to

contain conserved sequences for several SPI2 effectors, we compared this region (Miao and Miller, 2000)

Alignment of SPI-2 Effector Proteins

Miao E and Miller S. A conserved amino acid sequence directing intracellular type III secretion by Salmonella typhimurium. PNAS. 2000, 97(13). Pp. 7539-7544.

Published alignment of known and putative SPI2 effectors identified by a BLAST (Basic Local Alignment Search Tool) search and then aligned using ClustalW. Note the presence of the WEK(I/M)XXFF motif from approx. aa 31-38.

BLAST• Tries to find the most “similar” proteins• Compares a query to sequences in a database and each

comparison is given a score (higher scores are more similar)• Scoring matrices (substitution-based) are used to assign a score

based on the probability of each residue substitution • Gap penalties are negative scores• The alignment score is the sum of scores at each position• Significance of overall alignment given a p-value or an e-value

– e-value = expectation value: The number of different alignments with scores equivalent to or better than S that are expected to occur in a database search by chance. The lower the E value, the more significant the score.

Building Substitution Matrices: Part I

Henikoff S, Henikoff JG. Amino acid substitution matrices from protein blocks. PNAS (1992). pp.10915-10919.

Blocks: Local ungapped alignment with rows = protein segments and columns = amino acid position

1 A D E P Q D A2 A C E P D D A… … … … … …..……………………10 S D E P Q D A

New Sequence: A D E P Q R A -count number of matches and mismatches between new sequence and every other sequence in block. -We have 9AA matches and 1 AS mismatch in pos. 1

Building Substitution Matrices: Part II Next, sum the results of each column, store results in a table and add the new sequence to the group

By successively adding new sequences, we get a table with all possible pairs If we have 9 A’s and 1 S in the first column,

we get 1 + 2 + …+8=36 possible AA pairs and we get 9 AS or SA pairs and we get 0 SS pairs

If w = width of amino acids and s = # sequences, we have w*s*(s-1)/2 total possible pairs.Here, we have 36+9=45 or 1*10*9/2=45

Calculating the Lod (log-odds) Matrix• Let fij be the total number of amino acid

pairs in the frequency table at position i,j (1<=j<=i<=20)• Then the observed proportion for each

amino acid pairing is:

• We have fAA=36 and fAS=9, so qAA=36/45 and qAS=9/45

Calculating the Lod Matrix II• Now we need the expected probabilities of occurrence for each

amino acid pair• If we assume that the observed frequencies of each amino acid are

the population frequencies, we have

• For our example, pA=36/45+(9/45)/2 =0.9 and pS=(9/45)/2=0.1• Then the expected probability (eij)of occurrence is pipj for i=j and

pipj+pjpi for i!=j• We have expected probability of AA=0.9*0.9=0.81,

AS=2*0.9*0.1=0.18, SS=0.1*0.1=0.01

Calculating the Lod Matrix III

• Then we calculate the log-odds score in bits as sij=log2(qij/eij), so if we see more than expected, sij>0, if we see as many as expected, sij=0, and if we see less than expected, sij<0

• Multiplying s by 2 and rounding to the nearest integer, we obtain our values for the block substitution matrix (BLOSUM)

Clustering• To prevent “double-counting” amino acid contributions

from closely related proteins, sequences are clustered and counted as a single sequence in counting amino acids

• Thus, if two sequences are identical at >X% of their aligned positions, then contributions are averaged between the two

• In our example, if we were to cluster 8 of our sequences with A in the first position, we now have 2As and 1S

• These matrices will be denoted BLOSUM X, such as BLOSUM 62

Substitution Matrix (log-odds)

Based on observed frequencies of substitutions in related proteins; identical amino acids are given high positive scores, frequently observed substitutions get lower positive scores, and seldom observed substitutions get negative scores.

Related Calculations

• Relative entropy

measures the average information in bits that can be distinguishes an alignment from chance

• Expected score in bit units

Bioinformatics Approaches:Primary Structure

In pu t se qu en ces an d o th erd a tab ase p ro te ins re tu rn ed

H m m e r da tab a sese a rch

C o nse n su s se q ue n ced e ve lop ed

H m m e r m otif c re a tion

S e arch d a tab a se w ith m o tifa llo w in g fo r g a ps a nd p ro te in

su b stitu t io ns

T R V I co n se n sus c rea tion

C lu s ta lW a lign m e nt

M A S T d om a in d a ta b ase sea rch

M E M E d o m a in c re a tion

P ro te in S e qu e nces

Primary Sequence Search Methodology

Hmmer search of aligned sequences:

• Hmmer uses hidden markov models to make a profile probability matrix of amino acids from aligned sequences

• The matrix is searched against the appropriate genome database

TRVI search allowing for gaps and substitutions:

• A motif is developed by allowing for a flexible number of gaps wherever there are gaps in the alignment

• Substitutions of amino acids with similar properties are allowed

• The motif is searched against the appropriate genome database

MEME/MAST search of unaligned sequences:

• Identifies a specified number of domains (probability matrices) across a subset of the input sequences

• The domains are searched against the appropriate genome database

How Hmmer Works:Profile Hidden Markov Models for Protein Sequence Analysis

• http://hmmer.wustl.edu/

Hmmer Architecture• Squares are match states (consensus

positions), diamonds are insertions, circles are deletions and beginning/end. Arrows indicate state transitions.

Hidden Markov Model Background

From PMMB—Sandrine Dudoit See also http://www.ai.mit.edu/~murphyk/Bayes/rabiner.pdf

More Hidden Markov Model Background

Still More Background

Hmmer Intro• Each M/D/I is a node and are determined by data and the multiple

sequence alignment• Each M state aligns with a single amino acid and carries a vector

of 20 probabilities determined by the proportion of times that an amino acid has shown up in a position in a multiple sequence alignment

• Capable of handling gapped alignments• At each node either the M (amino acid aligned) or D state is used,

and I states occur between nodes and self-transition• Arrows are transition probabilities and are estimated by the

residues in each column of the multiple sequence alignment • S,N,C,T,J are “special states” that are algorithm-dependent and

controlled externally

Intermediate Hmmer

• Want to calculate P(S|M) where the sum over the space of all sequence should be 1

• …The rules of the HMM allow us to do this• Implied that the insertions follow a geometric

distribution • From a multiple sequence alignment “seed”,

Hmmer make a consensus sequences and searches databases against this consensus sequence

Hmmer Results

ClustalW Alignment of SPI1 Effectors

ClustalW Alignment of All Known Effectors

Analysis of TRVI-Putative Cytoplasmic Proteins

• Literature search– YciE not found– YciF classified as a putative structural protein by Blattner et al.

• BLAST searches– STM0274 almost exactly SciI (S. typhimurium); other homologies to ImpC and

ImpD (Rhizobium leguminosarum), and conserved hypotheticals—no literature on SciI, ImpC, nor ImpD

– YciF has homologies to other putative structural proteins in Shigella and E.coli. Also homologous to several conserved hypotheticals

– YciE has homologies to YciE from E.coli and other putative cytoplasmic/structural proteins in other species (YciE and YciF do not hit each other)

– STM3767 homologous to a 4-hydroxy-2-oxoglutarate aldolase and several hypothetical proteins

– STM4192 homologous to a nucleoprotein/polynucleotide-associated enzyme, hypothetical protein YaiL from E.coli, and hypotheticals (YaiL not in literature)

Analysis of TRVI-Microarray Proteins

• SseJ and YciE show up

• fruF is part of the phosphoenolpyruvate: fructose phosphotransferase system

• STM1181 is a putative flagella basal body part

S. typhimurium MEME Motif Summary

MEME MAST AnalysisMEME search results using MAST and searched by domain:

Domain 1: SseI, SlrP, SopA (putative effector proteins), YebEDomain 2: SseI, SlrP, YeeY, YeaH (putative cytoplasmic protein)Domain 3: SseI, HepA/RapA, Putative inner membrane protein (STM1698)Domain 4: YfeC, Putative periplasmic proteins (STM3783 and STM3605)Domain 5: RffG, OmpR (regulatory protein), PrpA,SirC (invasion regulator)Domain 6: SseI, SlrP, YadF, YaiB, PrpC(protein phosphatase), InvB

(part of needle complex)Domain 7: CitC (citrate carrier), YcfN, YjeQ, STM0611, STM2406Domain 8: DdlA (d-alanine ligase), GlyS, PgtA (phosphoglycerate transporter), STM4502

• Domains 1,3, and 5 look to be important for SPI2 secretion • The other domains are important for small, related subsets of proteins

MEME Including Putative Cytoplasmic Proteins

S. typhimurium Search Results Summary

Hmmer search of aligned sequences :Only the input sequences (+ 2 theoretically secreted proteins) were returned. SPI1 and SPI2 effectors both have significant e-values from a combined matrix.

TRVI search allowing for gaps and substitutions:56 hits returned—Possible interesting hits include SseI, 5 LysR family proteins, 5 putative cytoplasmic proteins , 1 putative periplasmic protein, 2 inner membrane proteins, and 3 flagellar proteins. 4 proteins (FruF, SseJ, YciE, and a putative flagellar protein) were also identified in a DNA microarray screen under SPI2 inducing conditions with cholesterol.

MEME search results using MAST and searched by domain:Domain 1: SseI, SlrP, SopA (putative effector proteins), YebEDomain 2: SseI, SlrP, YeeY, YeaH (putative cytoplasmic protein)Domain 3: SseI, HepA/RapA, Putative inner membrane protein (STM1698)Domain 4: YfeC, Putative periplasmic proteins (STM3783 and STM3605)Domain 5: RffG, OmpR (regulatory protein), PrpA,SirC (invasion regulator)Domain 6: SseI, SlrP, YadF, YaiB, PrpC(protein phosphatase), InvB (part of needle complex)Domain 7: CitC (citrate carrier), YcfN, YjeQ, STM0611, STM2406Domain 8: DdlA (d-alanine ligase), GlyS, PgtA (phosphoglycerate transporter), STM4502

Primary Structure Conclusions

• The best lead may be YciE, a putative cytoplasmic protein found with two different search methods

• The methods did not give the same output

• Hypothetical proteins found in the literature such as SipD, SptP (SPI1) and SpiC, SrfJ, SseB,C,D (SPI2) were not found

• All proteins that go out via SPI2 do not necessarily have the WEK(I/M)XXFF motif

• There is not a clear SPI1 motif

Secondary Structure Prediction

• Psipred structure prediction server used • Predictions made by two feed-forward neural

networks based on PSI-BLAST output• N-terminal motif (MEME 3)—random coil in all

SPI2 proteins• First SPI2 motif at aa 31-38 (MEME 1)—examples

are SseJ, SifA, SifB(+F), SlrP(+F), SseI, SspH1(+F)• Second SPI2 motif at aa 105-120 (no MEME)—

entirely random coil except for a small segment of SspH2

Secondary Structure Prediction of SifA

Alpha-helical Wheel (SifA,SifB)

WEK(I/M)XXFF is the Conserved motif among SPI2 effectors from aa 34-41 (positions 1,2,3,4,7).All show this profile but SseJ (position 7 is polar--still a hydrophobic face).

SspH1 Secondary Structure

SspH2 Alpha-Helical Wheel

SseG Secondary Structure

SseG Alpha-helical Wheel

SopD Alpha-Helical Wheel

Secondary Structure Conclusion

• A hydrophobic face on the alpha helix containing the conserved may be at least in part responsible for the translocation signal

• Other seemingly important domains do not have secondary structure (other than random coils)

• I have not looked at the SPI1 effectors nor the putative cytoplasmic proteins in this regard

3D Structure Prediction andComparison:

Ab initio • Prediction based solely upon the primary

amino acid sequence of the protein

• Rosetta Stone has done fairly well at CASP competitions– David Baker at U. of Washington

• Accuracy of predictions still in question

3D Prediction and Comparison: Homology Modeling

• BLAST protein of interest on proteins in the Brookhaven Protein Data Bank (PDB)

• If there is significant homology (approx. 30%), then a model for the protein of interest can be determined based on the known structure(s) of the other protein(s)

• This model can be compared to other known or predicted models to determine similarity

• The main flaw is that if there is not a sequence with significant homology that has been crystallized, this method cannot be used

Results of Swiss-Model Homology Search of all Putative and Know

Effectors• Only full-length SspH1, SspH2 and SopE had

enough homology to get structures• Only SopE gave me a result when I submitted the

first 150 amino acids

• The catalytic domain of SopE has been crystallized, but the first 77 amino acids are missing

• Only the Leucine-rich repeat region of SspH1 and SspH2 could be modeled (amino acids 158 and higher)

Tertiary Structure Examples

SspH1 homology-modeled toYopM. Homology starts at Amino acid 158. Geno3D2 used.

Catalytic domain of SopE (starts at aa 77)and cdc42

Future Directions• Do a similar primary structure analysis but expanding to

also include hypothetical proteins from the literature (19 such proteins)

• Study the different classes of proteins known to form the needle, form the translocon and act as chaperones

• Do secondary structure analysis on the known SPI1 proteins and on the putative cytoplasmic proteins just identified

• Try Rosetta Stone program

AcknowledgmentsKasturi Haldar

Team Salmonella:Drew “Big Daddy Salmonella” CatronEverett Roark

Team Malaria:Paul ChereshCarlos Lopez-EstranoSean MurphyThanos LykidisLuisa HillerThomas AkompongTravis HarrisonParwez NawabiSouvik Bhattacharjee

Team Bioinformatics:Dhugal BedfordVeronica Ryskin

bioinformatics approaches to identifying candidate effector molecules of s. typhimurium matthew...

Documents

sylvester drink eat

sylvester ciscolivertp april2014

salmonella enterica serotype typhimurium effector proteins...

outbreaks of multidrug-resistant salmonella … ·...

kloning gen fim-c salmonella typhimurium dengan...

the salmonella enterica serovar typhimurium virulence

sylvester final project

sylvester cary little_guy_big_data

sylvester stallone

introduction...

resume sylvester saragih

sylvester 1988

sylvester reu presentation

sylvester study guide

the pseudomonas syringae type iii effector hopd1 suppresses...

the salmonella typhimurium effector stec inhibits cdc42...

tumor vaccination by salmonella typhimurium after

assistant vice president, advancement for sylvester ... ·...

sotheby' david sylvester: the private collection...

salmonella typhimurium dt104 aus einer mesophilen .......