bioinformatics approaches to identifying candidate effector molecules of s. typhimurium matthew...
Post on 26-Dec-2015
215 Views
Preview:
TRANSCRIPT
Bioinformatics Approaches to Identifying Candidate Effector Molecules of S. typhimurium
Matthew Sylvester
12/1/03
Endocytic Trafficking
Lysosome
Salmonella-containing vacuole
Lamp-1H+ ATPaseCathepsinsTransferrin RMan. 6-PR
?
SPI-2
SPI-1
bacterial effectorproteins (SseJ, SifA,SseXs, and severalothers)
Selection of S. typhimurium Proteins
• Salmonella effectors are secreted into the host cell via either the Salmonella pathogenicity island 1 (SPI1) or SPI2 type three secretion system (TTSS)
• We chose only those proteins shown experimentally in the literature to go out through one or both of these systems
(see PubMed at http://ncbi.nlm.nih.gov)• The seventeen identified SPI1 and SPI2-associated effectors
were considered as one group for subsequent analysis• As the N-terminal 150 amino acids have been shown to
contain conserved sequences for several SPI2 effectors, we compared this region (Miao and Miller, 2000)
Alignment of SPI-2 Effector Proteins
Miao E and Miller S. A conserved amino acid sequence directing intracellular type III secretion by Salmonella typhimurium. PNAS. 2000, 97(13). Pp. 7539-7544.
Published alignment of known and putative SPI2 effectors identified by a BLAST (Basic Local Alignment Search Tool) search and then aligned using ClustalW. Note the presence of the WEK(I/M)XXFF motif from approx. aa 31-38.
BLAST• Tries to find the most “similar” proteins• Compares a query to sequences in a database and each
comparison is given a score (higher scores are more similar)• Scoring matrices (substitution-based) are used to assign a score
based on the probability of each residue substitution • Gap penalties are negative scores• The alignment score is the sum of scores at each position• Significance of overall alignment given a p-value or an e-value
– e-value = expectation value: The number of different alignments with scores equivalent to or better than S that are expected to occur in a database search by chance. The lower the E value, the more significant the score.
Building Substitution Matrices: Part I
Henikoff S, Henikoff JG. Amino acid substitution matrices from protein blocks. PNAS (1992). pp.10915-10919.
Blocks: Local ungapped alignment with rows = protein segments and columns = amino acid position
1 A D E P Q D A2 A C E P D D A… … … … … …..……………………10 S D E P Q D A
New Sequence: A D E P Q R A -count number of matches and mismatches between new sequence and every other sequence in block. -We have 9AA matches and 1 AS mismatch in pos. 1
Building Substitution Matrices: Part II Next, sum the results of each column, store results in a table and add the new sequence to the group
By successively adding new sequences, we get a table with all possible pairs If we have 9 A’s and 1 S in the first column,
we get 1 + 2 + …+8=36 possible AA pairs and we get 9 AS or SA pairs and we get 0 SS pairs
If w = width of amino acids and s = # sequences, we have w*s*(s-1)/2 total possible pairs.Here, we have 36+9=45 or 1*10*9/2=45
Calculating the Lod (log-odds) Matrix• Let fij be the total number of amino acid
pairs in the frequency table at position i,j (1<=j<=i<=20)• Then the observed proportion for each
amino acid pairing is:
• We have fAA=36 and fAS=9, so qAA=36/45 and qAS=9/45
Calculating the Lod Matrix II• Now we need the expected probabilities of occurrence for each
amino acid pair• If we assume that the observed frequencies of each amino acid are
the population frequencies, we have
• For our example, pA=36/45+(9/45)/2 =0.9 and pS=(9/45)/2=0.1• Then the expected probability (eij)of occurrence is pipj for i=j and
pipj+pjpi for i!=j• We have expected probability of AA=0.9*0.9=0.81,
AS=2*0.9*0.1=0.18, SS=0.1*0.1=0.01
Calculating the Lod Matrix III
• Then we calculate the log-odds score in bits as sij=log2(qij/eij), so if we see more than expected, sij>0, if we see as many as expected, sij=0, and if we see less than expected, sij<0
• Multiplying s by 2 and rounding to the nearest integer, we obtain our values for the block substitution matrix (BLOSUM)
Clustering• To prevent “double-counting” amino acid contributions
from closely related proteins, sequences are clustered and counted as a single sequence in counting amino acids
• Thus, if two sequences are identical at >X% of their aligned positions, then contributions are averaged between the two
• In our example, if we were to cluster 8 of our sequences with A in the first position, we now have 2As and 1S
• These matrices will be denoted BLOSUM X, such as BLOSUM 62
Substitution Matrix (log-odds)
Based on observed frequencies of substitutions in related proteins; identical amino acids are given high positive scores, frequently observed substitutions get lower positive scores, and seldom observed substitutions get negative scores.
Related Calculations
• Relative entropy
measures the average information in bits that can be distinguishes an alignment from chance
• Expected score in bit units
Bioinformatics Approaches:Primary Structure
In pu t se qu en ces an d o th erd a tab ase p ro te ins re tu rn ed
H m m e r da tab a sese a rch
C o nse n su s se q ue n ced e ve lop ed
H m m e r m otif c re a tion
In pu t se qu en ces an d o th erd a tab ase p ro te ins re tu rn ed
S e arch d a tab a se w ith m o tifa llo w in g fo r g a ps a nd p ro te in
su b stitu t io ns
T R V I co n se n sus c rea tion
C lu s ta lW a lign m e nt
In pu t se qu en ces an d o th erd a tab ase p ro te ins re tu rn ed
M A S T d om a in d a ta b ase sea rch
M E M E d o m a in c re a tion
P ro te in S e qu e nces
Primary Sequence Search Methodology
Hmmer search of aligned sequences:
• Hmmer uses hidden markov models to make a profile probability matrix of amino acids from aligned sequences
• The matrix is searched against the appropriate genome database
TRVI search allowing for gaps and substitutions:
• A motif is developed by allowing for a flexible number of gaps wherever there are gaps in the alignment
• Substitutions of amino acids with similar properties are allowed
• The motif is searched against the appropriate genome database
MEME/MAST search of unaligned sequences:
• Identifies a specified number of domains (probability matrices) across a subset of the input sequences
• The domains are searched against the appropriate genome database
How Hmmer Works:Profile Hidden Markov Models for Protein Sequence Analysis
• http://hmmer.wustl.edu/
Hmmer Architecture• Squares are match states (consensus
positions), diamonds are insertions, circles are deletions and beginning/end. Arrows indicate state transitions.
Hidden Markov Model Background
From PMMB—Sandrine Dudoit See also http://www.ai.mit.edu/~murphyk/Bayes/rabiner.pdf
Hmmer Intro• Each M/D/I is a node and are determined by data and the multiple
sequence alignment• Each M state aligns with a single amino acid and carries a vector
of 20 probabilities determined by the proportion of times that an amino acid has shown up in a position in a multiple sequence alignment
• Capable of handling gapped alignments• At each node either the M (amino acid aligned) or D state is used,
and I states occur between nodes and self-transition• Arrows are transition probabilities and are estimated by the
residues in each column of the multiple sequence alignment • S,N,C,T,J are “special states” that are algorithm-dependent and
controlled externally
Intermediate Hmmer
• Want to calculate P(S|M) where the sum over the space of all sequence should be 1
• …The rules of the HMM allow us to do this• Implied that the insertions follow a geometric
distribution • From a multiple sequence alignment “seed”,
Hmmer make a consensus sequences and searches databases against this consensus sequence
Analysis of TRVI-Putative Cytoplasmic Proteins
• Literature search– YciE not found– YciF classified as a putative structural protein by Blattner et al.
• BLAST searches– STM0274 almost exactly SciI (S. typhimurium); other homologies to ImpC and
ImpD (Rhizobium leguminosarum), and conserved hypotheticals—no literature on SciI, ImpC, nor ImpD
– YciF has homologies to other putative structural proteins in Shigella and E.coli. Also homologous to several conserved hypotheticals
– YciE has homologies to YciE from E.coli and other putative cytoplasmic/structural proteins in other species (YciE and YciF do not hit each other)
– STM3767 homologous to a 4-hydroxy-2-oxoglutarate aldolase and several hypothetical proteins
– STM4192 homologous to a nucleoprotein/polynucleotide-associated enzyme, hypothetical protein YaiL from E.coli, and hypotheticals (YaiL not in literature)
Analysis of TRVI-Microarray Proteins
• SseJ and YciE show up
• fruF is part of the phosphoenolpyruvate: fructose phosphotransferase system
• STM1181 is a putative flagella basal body part
MEME MAST AnalysisMEME search results using MAST and searched by domain:
Domain 1: SseI, SlrP, SopA (putative effector proteins), YebEDomain 2: SseI, SlrP, YeeY, YeaH (putative cytoplasmic protein)Domain 3: SseI, HepA/RapA, Putative inner membrane protein (STM1698)Domain 4: YfeC, Putative periplasmic proteins (STM3783 and STM3605)Domain 5: RffG, OmpR (regulatory protein), PrpA,SirC (invasion regulator)Domain 6: SseI, SlrP, YadF, YaiB, PrpC(protein phosphatase), InvB
(part of needle complex)Domain 7: CitC (citrate carrier), YcfN, YjeQ, STM0611, STM2406Domain 8: DdlA (d-alanine ligase), GlyS, PgtA (phosphoglycerate transporter), STM4502
• Domains 1,3, and 5 look to be important for SPI2 secretion • The other domains are important for small, related subsets of proteins
S. typhimurium Search Results Summary
Hmmer search of aligned sequences :Only the input sequences (+ 2 theoretically secreted proteins) were returned. SPI1 and SPI2 effectors both have significant e-values from a combined matrix.
TRVI search allowing for gaps and substitutions:56 hits returned—Possible interesting hits include SseI, 5 LysR family proteins, 5 putative cytoplasmic proteins , 1 putative periplasmic protein, 2 inner membrane proteins, and 3 flagellar proteins. 4 proteins (FruF, SseJ, YciE, and a putative flagellar protein) were also identified in a DNA microarray screen under SPI2 inducing conditions with cholesterol.
MEME search results using MAST and searched by domain:Domain 1: SseI, SlrP, SopA (putative effector proteins), YebEDomain 2: SseI, SlrP, YeeY, YeaH (putative cytoplasmic protein)Domain 3: SseI, HepA/RapA, Putative inner membrane protein (STM1698)Domain 4: YfeC, Putative periplasmic proteins (STM3783 and STM3605)Domain 5: RffG, OmpR (regulatory protein), PrpA,SirC (invasion regulator)Domain 6: SseI, SlrP, YadF, YaiB, PrpC(protein phosphatase), InvB (part of needle complex)Domain 7: CitC (citrate carrier), YcfN, YjeQ, STM0611, STM2406Domain 8: DdlA (d-alanine ligase), GlyS, PgtA (phosphoglycerate transporter), STM4502
Primary Structure Conclusions
• The best lead may be YciE, a putative cytoplasmic protein found with two different search methods
• The methods did not give the same output
• Hypothetical proteins found in the literature such as SipD, SptP (SPI1) and SpiC, SrfJ, SseB,C,D (SPI2) were not found
• All proteins that go out via SPI2 do not necessarily have the WEK(I/M)XXFF motif
• There is not a clear SPI1 motif
Secondary Structure Prediction
• Psipred structure prediction server used • Predictions made by two feed-forward neural
networks based on PSI-BLAST output• N-terminal motif (MEME 3)—random coil in all
SPI2 proteins• First SPI2 motif at aa 31-38 (MEME 1)—examples
are SseJ, SifA, SifB(+F), SlrP(+F), SseI, SspH1(+F)• Second SPI2 motif at aa 105-120 (no MEME)—
entirely random coil except for a small segment of SspH2
Alpha-helical Wheel (SifA,SifB)
WEK(I/M)XXFF is the Conserved motif among SPI2 effectors from aa 34-41 (positions 1,2,3,4,7).All show this profile but SseJ (position 7 is polar--still a hydrophobic face).
Secondary Structure Conclusion
• A hydrophobic face on the alpha helix containing the conserved may be at least in part responsible for the translocation signal
• Other seemingly important domains do not have secondary structure (other than random coils)
• I have not looked at the SPI1 effectors nor the putative cytoplasmic proteins in this regard
3D Structure Prediction andComparison:
Ab initio • Prediction based solely upon the primary
amino acid sequence of the protein
• Rosetta Stone has done fairly well at CASP competitions– David Baker at U. of Washington
• Accuracy of predictions still in question
3D Prediction and Comparison: Homology Modeling
• BLAST protein of interest on proteins in the Brookhaven Protein Data Bank (PDB)
• If there is significant homology (approx. 30%), then a model for the protein of interest can be determined based on the known structure(s) of the other protein(s)
• This model can be compared to other known or predicted models to determine similarity
• The main flaw is that if there is not a sequence with significant homology that has been crystallized, this method cannot be used
Results of Swiss-Model Homology Search of all Putative and Know
Effectors• Only full-length SspH1, SspH2 and SopE had
enough homology to get structures• Only SopE gave me a result when I submitted the
first 150 amino acids
• The catalytic domain of SopE has been crystallized, but the first 77 amino acids are missing
• Only the Leucine-rich repeat region of SspH1 and SspH2 could be modeled (amino acids 158 and higher)
Tertiary Structure Examples
SspH1 homology-modeled toYopM. Homology starts at Amino acid 158. Geno3D2 used.
Catalytic domain of SopE (starts at aa 77)and cdc42
Future Directions• Do a similar primary structure analysis but expanding to
also include hypothetical proteins from the literature (19 such proteins)
• Study the different classes of proteins known to form the needle, form the translocon and act as chaperones
• Do secondary structure analysis on the known SPI1 proteins and on the putative cytoplasmic proteins just identified
• Try Rosetta Stone program
top related