homology and sequence alignment
DESCRIPTION
Homology and sequence alignment. Homology. Homology = Similarity between objects due to a common ancestry. Hund = Dog, Schwein = Pig. Sequence homology. Similarity between sequences as a result of common ancestry. VLS P AV K WAKV G A HA AGHG ||| || |||| | |||| VLS E AV L WAKV E A DV AGHG. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/1.jpg)
1
Homology and sequence alignment.
![Page 2: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/2.jpg)
HomologyHomology = Similarity between objects due to a common ancestry
Hund = Dog,Schwein = Pig
![Page 3: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/3.jpg)
3
Sequence homology
VLSPAVKWAKVGAHAAGHG||| || |||| | ||||VLSEAVLWAKVEADVAGHG
Similarity between sequences as a result of common ancestry.
![Page 4: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/4.jpg)
4
Sequence alignment
Alignment: Comparing two (pairwise) or more (multiple) sequences. Searching for a series of identical or similar characters in the sequences.
![Page 5: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/5.jpg)
5
Why align?VLSPAVKWAKV||| || |||| VLSEAVLWAKV
1.To detect if two sequence are homologous. If so, homology may indicate similarity in function (and structure).
2.Required for evolutionary studies (e.g., tree reconstruction).
3.To detect conservation (e.g., a tyrosine that is evolutionary conserved is more likely to be a phosphorylation site).
![Page 6: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/6.jpg)
6
Sequence alignment
If two sequences share a common ancestor – for example human and dog hemoglobin, we can represent their evolutionary relationship using a tree
VLSPAV-WAKV||| || |||| VLSEAVLWAKV
VLSPAV-WAKV VLSEAVLWAKV
![Page 7: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/7.jpg)
7
Perfect match
VLSPAV-WAKV||| || |||| VLSEAVLWAKV
VLSPAV-WAKV VLSEAVLWAKV
A perfect match suggests that no change has occurred from the common ancestor (although this is not always the case).
![Page 8: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/8.jpg)
8
A substitution
VLSPAV-WAKV||| || |||| VLSEAVLWAKV
VLSPAV-WAKV VLSEAVLWAKV
A substitution suggests that at least one change has occurred since the common ancestor (although we cannot say in which lineage it has occurred).
![Page 9: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/9.jpg)
9
Indel
VLSPAV-WAKV||| || |||| VLSEAVLWAKV
VLSPAV-WAKV
VLSEAVLWAKV
Option 1: The ancestor had L and it was lost here. In such a case, the event was a deletion.
VLSEAVLWAKV
![Page 10: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/10.jpg)
10
Indel
VLSPAV-WAKV||| || |||| VLSEAVLWAKV
VLSPAV-WAKV
VLSEAVWAKV
Option 2: The ancestor was shorter and the L was inserted here. In such a case, the event was an insertion.
VLSEAVLWAKV
L
![Page 11: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/11.jpg)
11
Indel
VLSPAV-WAKV
Normally, given two sequences we cannot tell whether it was an insertion or a deletion, so we term the event as an indel.
VLSEAVLWAKV
Deletion? Insertion?
![Page 12: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/12.jpg)
12
Global vs. Local
• Global alignment – finds the best alignment across the entire two sequences.
• Local alignment – finds regions of similarity in parts of the sequences.
ADLGAVFALCDRYFQ|||| |||| |ADLGRTQN-CDRYYQ
ADLG CDRYFQ|||| |||| |ADLG CDRYYQ
Global alignment:
forces alignment in
regions which differ
Local alignment will
return only regions of
good alignment
![Page 13: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/13.jpg)
13
Global alignment
PTK2 protein tyrosine kinase 2 of human and rhesus monkey
![Page 14: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/14.jpg)
14
Proteins are comprised of domains
Domain B
Protein tyrosine kinase domain
Domain A
Human PTK2 :
![Page 15: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/15.jpg)
15
Protein tyrosine kinase domain
In leukocytes, a different gene for tyrosine kinase is expressed.
Domain X
Protein tyrosine kinase domain
Domain A
![Page 16: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/16.jpg)
16
Domain X
Protein tyrosine kinase domain
Domain BProtein tyrosine kinase domain
Domain A
Leukocyte TK
PTK2 The sequence similarity is restricted to a single domain
![Page 17: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/17.jpg)
17
Global alignment of PTK and LTK
![Page 18: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/18.jpg)
18
Local alignment of PTK and LTK
![Page 19: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/19.jpg)
19
Conclusions
Use global alignment when the two sequences share the same overall sequence arrangement.
Use local alignment to detect regions of similarity.
![Page 20: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/20.jpg)
20
How alignments are computed
![Page 21: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/21.jpg)
21
Pairwise alignment
AAGCTGAATTCGAAAGGCTCATTTCTGA
AAGCTGAATT-C-GAAAGGCT-CATTTCTGA-
One possible alignment:
![Page 22: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/22.jpg)
22
AAGCTGAATT-C-GAAAGGCT-CATTTCTGA-
This alignment includes:2 mismatches 4 indels (gap)
10 perfect matches
![Page 23: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/23.jpg)
23
Choosing an alignment for a pair of sequences
AAGCTGAATTCGAAAGGCTCATTTCTGA
AAGCTGAATT-C-GAAAGGCT-CATTTCTGA-
A-AGCTGAATTC--GAAAG-GCTCA-TTTCTGA-
Which alignment is better?
Many different alignments are
possible for 2 sequences:
![Page 24: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/24.jpg)
24
Scoring system (naïve)
AAGCTGAATT-C-GAAAGGCT-CATTTCTGA-
Score: = (+1)x10 + (-2)x2 + (-1)x4 = 2 Score: = (+1)x9 + (-2)x2 + (-1)x6 = -1
A-AGCTGAATTC--GAAAG-GCTCA-TTTCTGA-
Higher score Better alignment
Perfect match: +1
Mismatch: -2
Indel (gap): -1
![Page 25: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/25.jpg)
25
Alignment scoring - scoring of sequence similarity:
Assumes independence between positions:each position is considered separately
Scores each position:• Positive if identical (match)• Negative if different (mismatch or gap)
Total score = sum of position scoresCan be positive or negative
![Page 26: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/26.jpg)
26
Scoring system
•In the example above, the choice of +1 for match,-2 for mismatch, and -1 for gap is quite arbitrary
•Different scoring systems different alignments
•We want a good scoring system…
![Page 27: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/27.jpg)
27
DNA scoring matrices
Can take into account biological phenomena such as:
• Transition-transversion
![Page 28: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/28.jpg)
28
Amino-acid scoring matrices• Take into account physico-chemical properties
![Page 29: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/29.jpg)
29
Scoring gaps (I)
In advanced algorithms, two gaps of one amino-acid are given a different score than one gap of two amino acids. This is solved by giving a penalty to each gap that is opened.
Gap extension penalty < Gap opening penalty
![Page 30: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/30.jpg)
30
Homology versus chance similarity
How to check if the score is significant?
A. Take the two sequences Compute score.
B. Take one sequence randomly shuffle it -> find score with the second sequence. Repeat 100,000 times.
If the score in A is at the top 5% of the scores in B the similarity is significant.
![Page 31: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/31.jpg)
31
How close?
• Rule of thumb:
• Proteins are homologous if they are at least 25% identical (length >100)
• DNA sequences are homologous if they are at least 70% identical
![Page 32: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/32.jpg)
32
Twilight zone
• < 25% identity in proteins – may be homologous and may not be….
• (Note that 5% identity will be obtained completely by chance!)
![Page 33: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/33.jpg)
33
Searching a sequence database
Idea: In order to find homologous sequences to a sequence of interest, one should compute its pairwise alignment against all known sequences in a database, and detect the best scoring significant homologs
The same idea in short: Use your sequence as a query to find homologous
sequences in a sequence database
![Page 34: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/34.jpg)
34
Some terminology
• Query sequence - the sequence with which we are searching
• Hit – a sequence found in the database, suspected as homologous
![Page 35: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/35.jpg)
35
Query sequence: DNA or protein?
• For coding sequences, we can use the DNA sequence or the protein sequence to search for similar sequences.
• Which is preferable?
![Page 36: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/36.jpg)
36
Protein is better!
• Selection (and hence conservation) works (mostly) at the protein level:
CTTTCA = Leu-SerTTGAGT = Leu-Ser
![Page 37: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/37.jpg)
37
Query type
• Nucleotides: a four letter alphabet
• Amino acids: a twenty letter alphabet
• Two random DNA sequences will, on average, have 25% identity
• Two random protein sequences will, on average, have 5% identity
![Page 38: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/38.jpg)
38
Conclusion
The amino-acid sequence is often preferable for homology search
![Page 39: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/39.jpg)
39
How do we search a database?
• If each pairwise alignment takes 1/10 of a second, and if the database contains 107
sequences, it will take 106 seconds = 11.5 days to complete one search.
• 150,000 searches (at least!!) are performed per day. >82,000,000 sequence records in GenBank.
![Page 40: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/40.jpg)
40
Conclusion
• Using the exact comparison pairwise alignment algorithm between the query and all DB entries – too slow
![Page 41: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/41.jpg)
41
Heuristic
•Definition: a heuristic is a design to solve a problem that does not provide an exact solution (but is not too bad) but reduces the time complexity of the exact solution
![Page 42: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/42.jpg)
42
BLAST
![Page 43: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/43.jpg)
43
BLAST
• BLAST - Basic Local Alignment and Search Tool
• A heuristic for searching a database for similar sequences
![Page 44: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/44.jpg)
44
DNA or Protein• All types of searches are possible
Query: DNA Protein
Database: DNA Protein
blastn – nuc vs. nucblastp – prot vs. protblastx – translated query vs. protein databasetblastn – protein vs. translated nuc. DBtblastx – translated query vs. translated database
Translated databases:
trEMBLgenPept
![Page 45: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/45.jpg)
45
BLAST - underlying hypothesis
• The underlying hypothesis: when two sequences are similar there are short ungapped regions of high similarity between them
• The heuristic:
1. Discard irrelevant sequences
2. Perform exact local alignment only with the remaining sequences
![Page 46: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/46.jpg)
46
How do we discard irrelevant sequences quickly?
• Divide the database into words of length w (default: w = 3 for protein and w = 7 for DNA)
• Save the words in a look-up table that can be searched quickly
WTDFGYPAILKGGTAC
WTDTDFDFGFGYGYP …
![Page 47: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/47.jpg)
47
BLAST: discarding sequences
• When the user enters a query sequence, it is also divided into words
• Search the database for consecutive neighboring words
![Page 48: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/48.jpg)
48
Neighbor words
• neighbor words are defined according to a scoring matrix (e.g., BLOSUM62 for proteins) with a certain cutoff level
GFB
GFC (20)
GPC (11)WAC (5)
![Page 49: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/49.jpg)
49
E-value• The number of times we will theoretically
find an alignment with a score ≥ Y of a random sequence vs. a random database
Theoretically, we could trust
any result with an
E-value ≤ 1
In practice – BLAST uses estimations. E-values of 10-4 and lower indicate a
significant homology.E-values between 10-4 and 10-2 should be checked (similar domains, maybe
non-homologous).E-values between 10-2 and 1 do not
indicate a good homology
![Page 50: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/50.jpg)
Web servers for pairwise alignment
![Page 51: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/51.jpg)
BLAST 2 sequences (bl2Seq) at NCBI
Produces the local alignment of two given sequences using BLAST (Basic Local Alignment Search Tool) engine for local alignment
• Does not use an exact algorithm but a heuristic
![Page 52: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/52.jpg)
Back to NCBIBack to NCBI
![Page 53: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/53.jpg)
BLAST – bl2seqBLAST – bl2seq
![Page 54: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/54.jpg)
Bl2Seq - queryBl2Seq - query
blastnblastn – – nucleotidenucleotide blastpblastp – – proteinprotein
![Page 55: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/55.jpg)
Bl2seq resultsBl2seq results
![Page 56: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/56.jpg)
Bl2seq results
Match Match Dissimilarity Dissimilarity Gaps Gaps Similarity Similarity Low Low
complexity complexity
![Page 57: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/57.jpg)
BLAST – programsBLAST – programs
Query: DNA Protein
Database: DNA Protein
![Page 58: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/58.jpg)
BLAST – BlastpBLAST – Blastp
![Page 59: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/59.jpg)
Blastp - resultsBlastp - results
![Page 60: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/60.jpg)
Blastp – results (cont’)Blastp – results (cont’)
![Page 61: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/61.jpg)
Blast scores:
• Bits score – A score for the alignment according to the number of similarities, identities, etc.
• Expected-score (E-value) –The number of alignments with the same score one can “expect” to see by chance when searching a random database of a particular size. The closer the e-value is to zero, the greater the confidence that the hit is really a homolog
![Page 62: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/62.jpg)
Blastp – acquiring sequencesBlastp – acquiring sequences
![Page 63: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/63.jpg)
blastp – acquiring sequencesblastp – acquiring sequences
![Page 64: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/64.jpg)
64
VTISCTGSSSNIGAG-NHVKWYQQLPGVTISCTGTSSNIGS--ITVNWYQQLPGLRLSCSSSGFIFSS--YAMYWVRQAPGLSLTCTVSGTSFDD--YYSTWVRQPPGPEVTCVVVDVSHEDPQVKFNWYVDG--ATLVCLISDFYPGA--VTVAWKADS--AALGCLVKDYFPEP--VTVSWNSG---VSLTCLVKGFYPSD--IAVEWWSNG--
Similar to pairwise alignment BUT n sequences are aligned instead of just 2
Multiple sequence alignment
![Page 65: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/65.jpg)
65
MSA = Multiple Sequence AlignmentEach row represents an individual sequenceEach column represents the ‘same’ position
VTISCTGSSSNIGAG-NHVKWYQQLPGVTISCTGTSSNIGS--ITVNWYQQLPGLRLSCSSSGFIFSS--YAMYWVRQAPGLSLTCTVSGTSFDD--YYSTWVRQPPGPEVTCVVVDVSHEDPQVKFNWYVDG--ATLVCLISDFYPGA--VTVAWKADS--AALGCLVKDYFPEP--VTVSWNSG---VSLTCLVKGFYPSD--IAVEWWSNG--
Multiple sequence alignment
![Page 66: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/66.jpg)
66
Conserved positions
• Columns in which all the sequences contain the same amino acids or nucleotides
• Important for the function or structure
VTISCTGSSSNIGAG-NHVKWYQQLPGVTISCTGSSSNIGS--ITVNWYQQLPGLRLSCTGSGFIFSS--YAMYWYQQAPGLSLTCTGSGTSFDD-QYYSTWYQQPPG
![Page 67: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/67.jpg)
67
Consensus sequence
A T C T T G T
A A C T T G T
A A C T T C T
A A C T T G T
A consensus sequence holds the most frequent character of the alignment at each column
![Page 68: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/68.jpg)
68
Profile = PSSM = Position Specific Score Matrix
A T C T T G
A A C T T G
A A C T T C
1 2 3 4 5 6
A 1 .67 0 0 0 0
C 0 0 1 0 0 0.33
G 0 0 0 0 0 0.67
T 0 .33 0 1 1 0
![Page 69: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/69.jpg)
69
Alignment methods
There is no available optimal solution for MSA – all methods are heuristics:
• Progressive/hierarchical alignment (Clustal)
• Iterative alignment (mafft, muscle)
![Page 70: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/70.jpg)
70
ABCDE
Compute the pairwise Compute the pairwise alignments for all against alignments for all against
all (6 pairwise alignments).all (6 pairwise alignments).The similarities are The similarities are
converted to distances and converted to distances and stored in a tablestored in a table
First step:
Progressive alignment
A B C D E
A
B 8
C 15 17
D 16 14 10
E 32 31 31 32
![Page 71: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/71.jpg)
71
A
D
C
B
E
Cluster the sequences to create a Cluster the sequences to create a tree (tree (guide treeguide tree):):
•represents the order in which pairs of represents the order in which pairs of sequences are to be alignedsequences are to be aligned•similar sequences are neighbors in the similar sequences are neighbors in the tree tree •distant sequences are distant from distant sequences are distant from each other in the treeeach other in the tree
Second step: A B C D E
A
B 8
C 15 17
D 16 14 10
E 32 31 31 32The guide tree is imprecise and is NOT the tree which truly describes the evolutionary relationship between the sequences!
![Page 72: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/72.jpg)
72
Third step:A
D
C
B
E
1. Align the most similar (neighboring) pairs
sequence
sequence
sequence
sequence
![Page 73: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/73.jpg)
73
Third step:A
D
C
B
E
2. Align pairs of pairs
sequence
profile
![Page 74: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/74.jpg)
74
Third step:A
D
C
B
E sequence
profile
Main disadvantages:
•Sub-optimal tree topology
•Misalignments resulting from globally aligning pairs of sequences.
![Page 75: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/75.jpg)
75
ABCDE
Iterative alignment
Guide tree
MSA
Pairwise distance table
A
DCB
Iterate until the MSA does not change (convergence)
E
![Page 76: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/76.jpg)
76
Case study: Using homology searching
• The human kinome
![Page 77: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/77.jpg)
77
Kinases and phosphatases
![Page 78: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/78.jpg)
78
Multi-tasking enzymes
• Signal transduction• Metabolism• Transcription• Cell-cycle• Differentiation• Function of nervous and
immune system• …• And more
![Page 79: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/79.jpg)
79
How many kinases in the human genome?
• 1950’s, discovery that reversible phosphorylation regulates the activity of glycogen phosphorylase
• 1970’s, advent of cloning and sequencing produced a speculation that the vertebrate genome encodes as many as 1,001 kinases
![Page 80: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/80.jpg)
80
• 2001 – human genome sequence …
• As well – databases of Genbank, Swissprot, and dbEST
• How can we find out how many kinases are out there?
How many kinases in the human genome?
![Page 81: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/81.jpg)
81
The human kinome
• In 2002, Manning, Whyte, Martinez, Hunter and Sudarsanam set out to:
1. Search and cross-reference all these databases for all kinases
2. Characterize all found kinases
![Page 82: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/82.jpg)
82
ePKs and aPKs
Eukaryotic protein kinases (majority) catalytic domain
Atypical protein kinases
Sequence homology of the catalytic domain; additional regulatory domains are non-homologous
No sequence homology to ePKs; some aPK subfamilies have structural similarity to ePKs
![Page 83: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/83.jpg)
83
The search
• Several profiles were built:based on the catalytic domain of:
(a) 70 known ePKs from yeast, worm, fly, and human with > 50% identity in the ePK domain
(b) each subfamily of known aPKs
• HMM-profile searches and PSI-BLAST searches were performed
![Page 84: Homology and sequence alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062322/56815041550346895dbe3ffc/html5/thumbnails/84.jpg)
84
The results…
• 478 ePKs • 40 aPKs
• Total of 518 kinases
in the human genome
(half of the prediction
in the 1970’s)
[1.7% of human genes]