![Page 1: CS177 Lecture 13 Review/Summary of the Madej lectures](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814457550346895db0f2b8/html5/thumbnails/1.jpg)
CS177 Lecture 13 Review/Summary of the Madej
lectures
Tom Madej 12.06.04
![Page 2: CS177 Lecture 13 Review/Summary of the Madej lectures](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814457550346895db0f2b8/html5/thumbnails/2.jpg)
Overview
• Basic biology.
• Protein/DNA sequence comparison.
• Protein structure comparison/classification.
• NCBI databases overview.
• Miscellaneous topics.
![Page 3: CS177 Lecture 13 Review/Summary of the Madej lectures](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814457550346895db0f2b8/html5/thumbnails/3.jpg)
![Page 4: CS177 Lecture 13 Review/Summary of the Madej lectures](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814457550346895db0f2b8/html5/thumbnails/4.jpg)
Lodish et al. Molecular Cell Biology, W.H. Freeman 2000
![Page 5: CS177 Lecture 13 Review/Summary of the Madej lectures](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814457550346895db0f2b8/html5/thumbnails/5.jpg)
Protein/DNA sequence comparison
• What is the meaning of a sequence alignment?
• Scoring methods; amino acid substitution matrices, PSSMs.
• Basic computational methods; e.g. BLAST.
• Know how to run PSI-BLAST, interpret the results.
![Page 6: CS177 Lecture 13 Review/Summary of the Madej lectures](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814457550346895db0f2b8/html5/thumbnails/6.jpg)
Homology
“… whenever statistically significant sequence or structural similarity between proteins or protein domains is observed, this is an indication of their divergent evolution from a common ancestor or, in other words, evidence of homology.”
E.V. Koonin and M.Y. Galperin, Sequence – Evolution – Function, Kluwer 2003
![Page 7: CS177 Lecture 13 Review/Summary of the Madej lectures](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814457550346895db0f2b8/html5/thumbnails/7.jpg)
![Page 8: CS177 Lecture 13 Review/Summary of the Madej lectures](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814457550346895db0f2b8/html5/thumbnails/8.jpg)
A simple phylogenetic tree…
![Page 9: CS177 Lecture 13 Review/Summary of the Madej lectures](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814457550346895db0f2b8/html5/thumbnails/9.jpg)
Human hemoglobin and more distantly related globins
• Human and horse
• Human and fish
• Human and insect
• Human and bacteria
![Page 10: CS177 Lecture 13 Review/Summary of the Madej lectures](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814457550346895db0f2b8/html5/thumbnails/10.jpg)
Alignment notation: different notations for the same alignment!
VISDWNMPN-------MDGLECILVV----AANDGPMPQTRE
VISDWnm---pnMDGLECILVVaandgpmPQTRE
![Page 11: CS177 Lecture 13 Review/Summary of the Madej lectures](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814457550346895db0f2b8/html5/thumbnails/11.jpg)
Computing sequence alignments
• You must be able to recognize the “answer” (correct alignment) when you see it (scoring system).
• You must be able to find the answer; i.e. compute it efficiently.
![Page 12: CS177 Lecture 13 Review/Summary of the Madej lectures](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814457550346895db0f2b8/html5/thumbnails/12.jpg)
Scoring and computing alignments
• “Position independent” amino acid substitution tables; e.g. BLOSUM62.
• Global alignment algorithms such as Smith-Waterman (dynamic programming); or fast heuristics such as BLAST.
![Page 13: CS177 Lecture 13 Review/Summary of the Madej lectures](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814457550346895db0f2b8/html5/thumbnails/13.jpg)
![Page 14: CS177 Lecture 13 Review/Summary of the Madej lectures](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814457550346895db0f2b8/html5/thumbnails/14.jpg)
Score this alignment:
VISDWnm---pnMDGLECILVVaandgpmPQTRE
Use: BLOSUM62 matrix; gap opening penalty 10;gap extension penalty 1
(-1 + 4 – 2 – 3 – 3) –10 – 1*11 + (-2 + 0 – 2 – 2 + 5) = -27
![Page 15: CS177 Lecture 13 Review/Summary of the Madej lectures](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814457550346895db0f2b8/html5/thumbnails/15.jpg)
BLAST (Basic Local Alignment Search Tool)
• Extremely fast, can be on the order of 50-100 times faster than Smith-Waterman.
• Method of choice for database searches.
• Statistical theory for significance of results (extreme value distribution).
• Heuristic; does not guarantee optimal results.
• Many variants, e.g. PHI-, PSI-, RPS-BLAST.
![Page 16: CS177 Lecture 13 Review/Summary of the Madej lectures](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814457550346895db0f2b8/html5/thumbnails/16.jpg)
Why database searches?
• Gene finding.
• Assigning likely function to a gene.
• Identifying regulatory elements.
• Understanding genome evolution.
• Assisting in sequence assembly.
• Finding relations between genes.
![Page 17: CS177 Lecture 13 Review/Summary of the Madej lectures](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814457550346895db0f2b8/html5/thumbnails/17.jpg)
Issues in database searches
• Speed.
• Relevance of the search results (selectivity).
• Recovering all information of interest (sensitivity).– The results depend on the search parameters, e.g. gap
penalty, scoring matrix.– Sometimes searches with more than one matrix should be
performed.
![Page 18: CS177 Lecture 13 Review/Summary of the Madej lectures](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814457550346895db0f2b8/html5/thumbnails/18.jpg)
E-values, P-values
• E-value, Expectation value; this is the expected number of hits of at least the given score, that you would expect by random chance for the search database.
• P-value, Probability value; this is the probability that a hit would attain at least the given score, by random chance for the search database.
• E-values are easier to interpret than P-values.
• If the E-value is small enough, e.g. no more than 0.10, then it is essentially a P-value.
![Page 19: CS177 Lecture 13 Review/Summary of the Madej lectures](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814457550346895db0f2b8/html5/thumbnails/19.jpg)
PSI-BLAST
• Position Specific Iterated BLAST
• As a first step runs a (regular) BLAST.
• Hits that cross the threshold are used to construct a position specific score matrix (PSSM).
• A new search is done using the PSSM to find more remotely related sequences.
• The last two steps are iterated until convergence.
![Page 20: CS177 Lecture 13 Review/Summary of the Madej lectures](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814457550346895db0f2b8/html5/thumbnails/20.jpg)
PSSM (Position Specific Score Matrix)
• One column per residue in the query sequence.
• Per-column residue frequencies are computed so that log-odds scores may be assigned to each residue type in each column.
• There are difficulties; e.g. pseudo-counts are needed if there are not a lot of sequences, the sequences must be weighted to compensate for redundancy.
![Page 21: CS177 Lecture 13 Review/Summary of the Madej lectures](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814457550346895db0f2b8/html5/thumbnails/21.jpg)
Two key advantages of PSSMs
• More sensitive scoring because of improved estimates of probabilities for a.a.’s at specific positions.
• Describes the important motifs that occur in the protein family and therefore enhances the selectivity.
![Page 22: CS177 Lecture 13 Review/Summary of the Madej lectures](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814457550346895db0f2b8/html5/thumbnails/22.jpg)
Position Specific Substitution Rates
Active site serineWeakly conserved serine
![Page 23: CS177 Lecture 13 Review/Summary of the Madej lectures](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814457550346895db0f2b8/html5/thumbnails/23.jpg)
Position Specific Score Matrix (PSSM)
A R N D C Q E G H I L K M F P S T W Y V 206 D 0 -2 0 2 -4 2 4 -4 -3 -5 -4 0 -2 -6 1 0 -1 -6 -4 -1 207 G -2 -1 0 -2 -4 -3 -3 6 -4 -5 -5 0 -2 -3 -2 -2 -1 0 -6 -5 208 V -1 1 -3 -3 -5 -1 -2 6 -1 -4 -5 1 -5 -6 -4 0 -2 -6 -4 -2 209 I -3 3 -3 -4 -6 0 -1 -4 -1 2 -4 6 -2 -5 -5 -3 0 -1 -4 0 210 D -2 -5 0 8 -5 -3 -2 -1 -4 -7 -6 -4 -6 -7 -5 1 -3 -7 -5 -6 211 S 4 -4 -4 -4 -4 -1 -4 -2 -3 -3 -5 -4 -4 -5 -1 4 3 -6 -5 -3 212 C -4 -7 -6 -7 12 -7 -7 -5 -6 -5 -5 -7 -5 0 -7 -4 -4 -5 0 -4 213 N -2 0 2 -1 -6 7 0 -2 0 -6 -4 2 0 -2 -5 -1 -3 -3 -4 -3 214 G -2 -3 -3 -4 -4 -4 -5 7 -4 -7 -7 -5 -4 -4 -6 -3 -5 -6 -6 -6 215 D -5 -5 -2 9 -7 -4 -1 -5 -5 -7 -7 -4 -7 -7 -5 -4 -4 -8 -7 -7 216 S -2 -4 -2 -4 -4 -3 -3 -3 -4 -6 -6 -3 -5 -6 -4 7 -2 -6 -5 -5 217 G -3 -6 -4 -5 -6 -5 -6 8 -6 -8 -7 -5 -6 -7 -6 -4 -5 -6 -7 -7 218 G -3 -6 -4 -5 -6 -5 -6 8 -6 -7 -7 -5 -6 -7 -6 -2 -4 -6 -7 -7 219 P -2 -6 -6 -5 -6 -5 -5 -6 -6 -6 -7 -4 -6 -7 9 -4 -4 -7 -7 -6 220 L -4 -6 -7 -7 -5 -5 -6 -7 0 -1 6 -6 1 0 -6 -6 -5 -5 -4 0 221 N -1 -6 0 -6 -4 -4 -6 -6 -1 3 0 -5 4 -3 -6 -2 -1 -6 -1 6 222 C 0 -4 -5 -5 10 -2 -5 -5 1 -1 -1 -5 0 -1 -4 -1 0 -5 0 0 223 Q 0 1 4 2 -5 2 0 0 0 -4 -2 1 0 0 0 -1 -1 -3 -3 -4 224 A -1 -1 1 3 -4 -1 1 4 -3 -4 -3 -1 -2 -2 -3 0 -2 -2 -2 -3
Active site nucleophile
Serine scored differently in these two positions
![Page 24: CS177 Lecture 13 Review/Summary of the Madej lectures](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814457550346895db0f2b8/html5/thumbnails/24.jpg)
PSI-BLAST key points
• The first PSSM is constructed from all hits that cross the significance threshold using “standard” BLAST.
• The search is then carried out with the PSSM to draw in new significant hits.
• If new hits are found then a new PSSM is constructed; these last two steps are iterated.
• The computation terminates upon “convergence”, i.e. when no new sequences are found to cross the significance threshold.
![Page 25: CS177 Lecture 13 Review/Summary of the Madej lectures](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814457550346895db0f2b8/html5/thumbnails/25.jpg)
Protein structure comparison/classification
• Protein secondary structure elements.
• Supersecondary structures (simple structure motifs).
• Folds and domains.
• Comparing structures (VAST).
• Superfolds.
• Fold classification (SCOP).
• Conserved Domain Database (CDD).
![Page 26: CS177 Lecture 13 Review/Summary of the Madej lectures](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814457550346895db0f2b8/html5/thumbnails/26.jpg)
α-helix (3chy)
backbone atoms with sidechains
![Page 27: CS177 Lecture 13 Review/Summary of the Madej lectures](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814457550346895db0f2b8/html5/thumbnails/27.jpg)
Parallel β-strands (3chy)
![Page 28: CS177 Lecture 13 Review/Summary of the Madej lectures](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814457550346895db0f2b8/html5/thumbnails/28.jpg)
Anti-parallel β-strands (1hbq)
![Page 29: CS177 Lecture 13 Review/Summary of the Madej lectures](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814457550346895db0f2b8/html5/thumbnails/29.jpg)
Higher level organization
• A single protein may consist of multiple domains. Examples: 1liy A, 1bgc A. The domains may or may not perform different functions.
• Proteins may form higher-level assemblies. Useful for complicated biochemical processes that require several steps, e.g. processing/synthesis of a molecule. Example: 1l1o chains A, B, C.
![Page 30: CS177 Lecture 13 Review/Summary of the Madej lectures](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814457550346895db0f2b8/html5/thumbnails/30.jpg)
Supersecondary structures
• β-hairpin
• α-hairpin
• βαβ-unit
• β4 Greek key
• βα Greek key
![Page 31: CS177 Lecture 13 Review/Summary of the Madej lectures](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814457550346895db0f2b8/html5/thumbnails/31.jpg)
Supersecondary structure: simple units
G.M. Salem et al. J. Mol. Biol. (1999) 287 969-981
![Page 32: CS177 Lecture 13 Review/Summary of the Madej lectures](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814457550346895db0f2b8/html5/thumbnails/32.jpg)
Supersecondary structure: Greek key motifs
G.M. Salem et al. J. Mol. Biol. (1999) 287 969-981
![Page 33: CS177 Lecture 13 Review/Summary of the Madej lectures](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814457550346895db0f2b8/html5/thumbnails/33.jpg)
Protein folds
• There is a continuum of similarity!
• Fold definition: two folds are similar if they have a similar arrangement of SSEs (architecture) and connectivity (topology). Sometimes a few SSEs may be missing.
• Fold classification: To get an idea of the variety of different folds, one must adjust for sequence redundancy and also try to correctly assign homologs that have low sequence identity (e.g. below 25%).
![Page 34: CS177 Lecture 13 Review/Summary of the Madej lectures](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814457550346895db0f2b8/html5/thumbnails/34.jpg)
Vector Alignment Search Tool (VAST)
• Fast structure comparison based on representing SSEs by vectors.
• A measure of statistical significance (VAST E-value) is computed (very differently from a BLAST E-value).
• VAST structure neighbor lists useful for recognizing structural similarity.
![Page 35: CS177 Lecture 13 Review/Summary of the Madej lectures](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814457550346895db0f2b8/html5/thumbnails/35.jpg)
Superfolds (Orengo, Jones, Thornton)
• Distribution of fold types is highly non-uniform.
• There are about 10 types of folds, the superfolds, to which about 30% of the other folds are similar. There are many examples of “isolated” fold types.
• Superfolds are characterized by a wide range of sequence diversity and spanning a range of non-similar functions.
• It is a research question as to the evolutionary relationships of the superfolds, i.e. do they arise by divergent or convergent evolution?
![Page 36: CS177 Lecture 13 Review/Summary of the Madej lectures](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814457550346895db0f2b8/html5/thumbnails/36.jpg)
Superfolds and examples
• Globin 1hlm sea cucumber hemoglobin; 1cpcA phycocyanin; 1colA colicin
• α-up-down 2hmqA hemerythrin; 256bA cytochrome B562; 1lpe apolipoprotein E3
• Trefoil 1i1b interleukin-1β; 1aaiB ricin; 1tie erythrina trypsin inhibitor
• TIM barrel 1timA triosephosphate isomerase; 1ald aldolase; 5rubA rubisco
• OB fold 1quqA replication protein A 32kDa subunit; 1mjc major cold-shock protein; 1bcpD pertussis toxin S5 subunit
• α/β doubly-wound 5p21 Ras p21; 4fxn flavodoxin; 3chy CheY
• Immunoglobulin 2rhe Bence-Jones protein; 2cd4 CD4; 1ten tenascin
• UB αβ roll 1ubq ubiquitin; 1fxiA ferredoxin; 1pgx protein G
• Jelly roll 2stv tobacco necrosis virus; 1tnfA tumor necrosis factor; 2ltnA pea lectin
• Plaitfold (Split αβ sandwich) 1aps acylphosphatase; 1fxd ferredoxin; 2hpr histidine-containing phosphocarrier
![Page 37: CS177 Lecture 13 Review/Summary of the Madej lectures](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814457550346895db0f2b8/html5/thumbnails/37.jpg)
SCOP (Structural Classification of Proteins)
• http://scop.mrc-lmb.cam.ac.uk/scop/
• Levels of the SCOP hierarchy:– Family: clear evolutionary relationship– Superfamily: probable common evolutionary origin– Fold: major structural similarity
![Page 38: CS177 Lecture 13 Review/Summary of the Madej lectures](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814457550346895db0f2b8/html5/thumbnails/38.jpg)
![Page 39: CS177 Lecture 13 Review/Summary of the Madej lectures](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814457550346895db0f2b8/html5/thumbnails/39.jpg)
Bioinformatics databases
• Entrez is by far the most useful, because of the links between the individual databases, e.g. literature, sequence, structure, taxonomy, etc.
• Other specialty databases available on the internet can also be very useful, of course!
![Page 40: CS177 Lecture 13 Review/Summary of the Madej lectures](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814457550346895db0f2b8/html5/thumbnails/40.jpg)
Genomes
Taxonomy
Links Between and Within Nodes
PubMed abstracts
Nucleotide sequences
Protein sequences
3-D Structure
3 -D Structures
Word weight
VAST
BLASTBLAST
Phylogeny
ComputationalComputational
Computational
Computational
![Page 41: CS177 Lecture 13 Review/Summary of the Madej lectures](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814457550346895db0f2b8/html5/thumbnails/41.jpg)
Entrez queries
• Be able to formulate queries using index terms (Preview/Index), and limits.
![Page 42: CS177 Lecture 13 Review/Summary of the Madej lectures](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814457550346895db0f2b8/html5/thumbnails/42.jpg)
![Page 43: CS177 Lecture 13 Review/Summary of the Madej lectures](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814457550346895db0f2b8/html5/thumbnails/43.jpg)
Exercises!
• How many protein structures are there that include DNA and are from bacteria?
• In PubMed, how many articles are there from the journal Science and have “Alzheimer” in the title or abstract, and “amyloid beta” anywhere? How many since the year 2000?
• Notice that the results are not 100% accurate!
• In 3D Domains, how many domains are there with no more than two helices and 8 to 10 strands and are from the mouse?
![Page 44: CS177 Lecture 13 Review/Summary of the Madej lectures](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814457550346895db0f2b8/html5/thumbnails/44.jpg)
P53 tumor suppressor protein
• Li-Fraumeni syndrome; only one functional copy of p53 predisposes to cancer.
• Mutations in p53 are found in most tumor types.
• p53 binds to DNA and stimulates another gene to produce p21, which binds to another protein cdk2. This prevents the cell from progressing thru the cell cycle.
![Page 45: CS177 Lecture 13 Review/Summary of the Madej lectures](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814457550346895db0f2b8/html5/thumbnails/45.jpg)
G. Giglia-Mari, A. Sarasi, Hum. Mutat. (2003) 21 217-228.
![Page 46: CS177 Lecture 13 Review/Summary of the Madej lectures](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814457550346895db0f2b8/html5/thumbnails/46.jpg)
Exercise!
• Use Cn3D to investigate the binding of p53 to DNA.
• Formulate a query for Structure that will require the DNA molecules to be present (there are 2 structures like this).
![Page 47: CS177 Lecture 13 Review/Summary of the Madej lectures](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814457550346895db0f2b8/html5/thumbnails/47.jpg)
Miscellaneous topics
• BLAST a sequence against a genome; locate hits on chromosomes with map viewer.
• Obtain genomic sequence with map viewer.
• Spidey to predict intron/exon structure.
• How sequence variations can affect protein structure/function.
![Page 48: CS177 Lecture 13 Review/Summary of the Madej lectures](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814457550346895db0f2b8/html5/thumbnails/48.jpg)
“EST exercise” summary
• BLAST the EST (or other DNA seq) against the genome.
• From the BLAST output you can get the genomic coordinates of any nucleotide differences.
• Use map viewer to locate the hit on a chromosome; assume the hit is in the region of a gene.
• By following the gene link you can get an accession for mRNA.• By using the “dl” link you can get an accession for the genomic
sequence.
• Use “spidey” with the mRNA and genomic sequence to locate changed residues in the protein.
![Page 49: CS177 Lecture 13 Review/Summary of the Madej lectures](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814457550346895db0f2b8/html5/thumbnails/49.jpg)
“EST exercise” summary (cont.)
• From the gene report you can follow the protein link, and then “Blink”.
• From the BLAST link page you can get to CDD and related structures.
• Since you know where are the changed residues you can use the structures to study what effect the changes might have on the function of the protein.
![Page 50: CS177 Lecture 13 Review/Summary of the Madej lectures](https://reader035.vdocuments.mx/reader035/viewer/2022062801/56814457550346895db0f2b8/html5/thumbnails/50.jpg)
Gene variants that can affect protein function
• Mutation to a stop codon; truncates the protein product!
• Insertion/deletion of multiple bases; changes the sequence of amino acid residues.
• Single point change could alter folding properties of the protein.
• Single point change could affect the active site of the protein.
• Single point change could affect an interaction site with another molecule.