michael schroeder biotechnological center tu dresden biotec sequence comparison and phylogeny based...
TRANSCRIPT
Michael Schroeder BioTechnological CenterTU Dresden
Biotec
Sequence comparisonand Phylogeny
based onChapter 4
Lesk, Introduction to Bioinformatics
By Michael Schroeder, Biotec 2
Contents
Motivation Sequence comparison and alignments
Dot plots Dynamic programming Substitution matrices Dynamic programming: Local and global alignments
and gaps BLAST Significance of alignments
Multiple sequence alignments Phylogenetic trees
By Michael Schroeder, Biotec 3
Motivation
From where are we? Recent Africa vs. Multi-regional Hypothese
In 1999 Encephalitis caused by the West Nile Virus broke out in New York. How did the virus come to New York?
How did the nucleus get into the eucaryotic cells?
To answer such questions we will need sequence comparison and phylogenetic trees
By Michael Schroeder, Biotec 4
Sequence
SNLIPSPPTYNSAHDYISWESFSNVSYYTRILPSHLIDSPPTYNADYGYKSWEAFSNLSYYTRALP*.** ******: : * ***:***:***** **
MSRSLLLWFLLFLLLLPPLPVLLADPGAPTPVN--------------MLARALLLCAVLALSHTAN :* :* * . .*
By Michael Schroeder, Biotec 5
Sequence Similarity Searches
Sequence similarity can be due to a common evolutionary ancestor… E.g. globin genes in chimpanzees and humans
… or common function E.g. v-sys onco gene in simian sarcoma virus leading to cancer
in monkeys and the seemingly unrelated growth stimulating protein PDGF, which stimulates cell growth (first success of similarity idea, 1983)
In general: If an unknown sequence is found, deduce its
function/structure indirectly by finding similar sequences, whose function/structure is known
Assumption: Evolution changes sequences “slowly” often maintaining main features of a sequence’s function/structure
By Michael Schroeder, Biotec 6
Sequence alignment
Substitutions, insertions and deletions can be interpreted in evolutionary terms
But: distinguish chance similarity and real biological relationship
CCGTAA
CCGTAT
TCGTAGTAGTAC
TCGTAC
TCGTAA
TTGTAA
By Michael Schroeder, Biotec 7
Evolution
Convergent evolution: same sequence evolved from different ancestors
Back evolution - mutate to a previous sequence
CCGTAA
CCGTAT
TCGTAGTAGTAC
TCGTAC
TCGTAA
TAGTAC CCGTAA
TAGTAA
By Michael Schroeder, Biotec 8
Similarity vs. Homology
Any sequence can be similar Sequences homologues if evolved from common
ancestor Homologous sequences:
Orthologs: similar biological function Paralogs: different biological function (after gene
duplication), e.g. lysozyme and α-lactalbumin, a mammalian regulatory protein
Assumption: Similarity indicator for homology Note, altered function of the expressed protein will
determine if the organism will survive to reproduce, and hence pass on the altered gene
By Michael Schroeder, Biotec 9
Sequence alignments
Given two or more sequences, we wish to
Measure their similarity Determine the residue-residue correspondences Observe patterns of conservation and variability Infer evolutionary relationships
By Michael Schroeder, Biotec 10
What is the best alignment?
Uninformative: -------gctgaacgctataatc-------
Without gaps: gctgaacgctataatc
With gaps: gctga-a--cg--ct-ataatc
Another one: gctg-aa-cg-ctataatc-
Formally: The best alignments have only a minimal number of mismatches (insertions, deletions, replace)
We need a method to systematically explore and to compute alignments
By Michael Schroeder, Biotec 11
Scores for an alignment Sequence Identity: Percentage of matches Score each match, mismatch, gap opening, gap extension[attg] a t t g[accc] a c - - Example
match +1 mismatch -1 Gap opening -3 Gap extension -1
Uninformative: 0%, score= -19 -------gctgaacgctataatc-------
Without gaps: 25%, score= -4 gctgaacgctataatc
With gaps: 0%, score= -19 gctga-a--cg--ct-ataatc
Another one: 50%, score= -8 gctg-aa-cg-ctataatc-
By Michael Schroeder, Biotec 12
Scores for an alignment Sequence Identity: Percentage of matches Score each match, mismatch, gap opening, gap extension[attg] a t t g[accc] a c - - Example
match +2 mismatch -1 Gap opening -1 Gap extension -1
Uninformative: 0%, score= -15 -------gctgaacgctataatc-------
Without gaps: 25%, score= -2 gctgaacgctataatc
With gaps: 0%, score= -11 gctga-a--cg--ct-ataatc
Another one: 50%, score= 5 gctg-aa-cg-ctataatc-
By Michael Schroeder, Biotec 13
Dot plots
By Michael Schroeder, Biotec 14
Dot plots
A convenient way of comparing 2 sequences visually Use matrix, put 1 sequence on X-axis, 1 on Y-axis Cells with
identical characters filled with a ‘1’, non-identical with ‘0’ (simplest scheme - could have weights)
By Michael Schroeder, Biotec 15
Dot plots
N
I
K
G
D
O
H
Y
H
T
O
R
O
D
NIKGDOHTOOFWORCYHTOROD
By Michael Schroeder, Biotec 16
Dot plots
NN
II
KK
GG
DDD
OOOOOOO
HHH
YY
HHH
TTT
OOOOOOO
RRR
OOOOOOO
DDD
NIKGDOHTOOFWORCYHTOROD
By Michael Schroeder, Biotec 17
Interpreting dot plots
What do identical sequences look like? What do unrelated sequences look like? What do distantly related sequences look like?
What does reverse sequence look like? Relevant for detections of stems in RNA structure
What does a palindrome look like? Relevant for restriction enzymes
What do repeats look like? What does a protein with domains A and B and another
one with domains B and C look like?
By Michael Schroeder, Biotec 18
Dot plot for identical sequences
NN
II
KK
GG
DDD
OOOO
HHH
YY
HHH
TT
OOOO
RR
OOOO
DDD
NIKGDOHYHTOROD
By Michael Schroeder, Biotec 19
Dotplot for unrelated sequences
RR
E
TT
E
II
DDD
OOOO
TT
TT
OOOO
NIKGDOHYHTOROD
By Michael Schroeder, Biotec 20
Dotplot for distantly related sequences
NN
II
KK
NN
E
J
YY
HHH
TT
OOOO
M
II
TT
NIKGDOHYHTOROD
By Michael Schroeder, Biotec 21
Dotplot for reverse sequences
Relevant to identify stems in RNA structures Plot sequence against its reverse complement
By Michael Schroeder, Biotec 22
Dotplot for reverse sequences
DD
OOOO
RR
OOOO
TT
HHH
YY
HHH
OOOO
DDD
GG
KK
II
NN
NIKGDOHYHTOROD
By Michael Schroeder, Biotec 23
Palindromes and restriction enzymes
Madam, I'm Adam Able was I ere I saw Elba (supposedly said by Napoleon) Doc note I dissent, a fast never prevents a fatness, I diet on cod.
Because DNA is double stranded and the strands run antiparallel, palindromes are defined as any double stranded DNA in which reading 5’ to 3’ both are the same
The HindIII cutting site:– 5'-AAGCTT-3'– 3'-TTCGAA-5'
The EcoRI cutting site:– 5'-GAATTC-3'– 3'-CTTAAG-5'
By Michael Schroeder, Biotec 24
Dotplot of a Palindrome
MMM
AAA
DD
AAA
MMM
MADAM
By Michael Schroeder, Biotec 25
Dotplot of repeats
EEEE
NNNN
OOO
YYY
TTTTTT
NNNN
EEEE
WWWW
TTTTTT
OOO
WWWW
TTTTTT
YYY
TTTTTT
NNNN
EEEE
WWWW
TTTTTT
OWTYTNEWTENOYTNEWT
By Michael Schroeder, Biotec 26
Dotplot of Repeats/Palindrome
MMMMM
AAAAA
DDD
AAAAA
MMMMM
II
MMMMM
AAAAA
DDD
AAAAA
MMMMM
MADAMIMADAM
By Michael Schroeder, Biotec 27
Dotplot for shared domain
RR
E
L
L
II
M
YY
HHH
TT
OOOO
RR
OOOO
DD
NIKGDOHYHTOROD
By Michael Schroeder, Biotec 28
ResultDot plot
dorothycrowfoothodgkind* * o * * * ** * r * * o * * * ** * t * * h * * y * h * * o * * * ** * d* * g * k * i * n *
By Michael Schroeder, Biotec 29
Dotplots
Window size 15 Dot if
6 matches in window
By Michael Schroeder, Biotec 30
Window size 15 Dot if
6 matches in window
Cacain and Caricain, two proteases from papaya
By Michael Schroeder, Biotec 31
>gi|1942644|pdb|1MEG| Crystal Structure Of A Caricain D158e Mutant In Complex With E-64
Length = 216
Score = 271 bits (693), Expect = 1e-73 Identities = 142/216 (65%), Positives = 168/216 (77%), Gaps = 4/216 (1%)
Query: 1 IPEYVDWRQKGAVTPVKNQGSCGSCWAFSAVVTIEGIIKIRTGNLNQYSEQELLDCDRRS 60 +PE VDWR+KGAVTPV++QGSCGSCWAFSAV T+EGI KIRTG L + SEQEL+DC+RRSSbjct: 1 LPENVDWRKKGAVTPVRHQGSCGSCWAFSAVATVEGINKIRTGKLVELSEQELVDCERRS 60
Query: 61 YGCNGGYPWSALQLVAQYGIHYRNTYPYEGVQRYCRSREKGPYAAKTDGVRQVQPYNQGA 120 +GC GGYP AL+ VA+ GIH R+ YPY+ Q CR+++ G KT GV +VQP N+G Sbjct: 61 HGCKGGYPPYALEYVAKNGIHLRSKYPYKAKQGTCRAKQVGGPIVKTSGVGRVQPNNEGN 120
Query: 121 LLYSIANQPVSVVLQAAGKDFQLYRGGIFVGPCGNKVDHAVAAV----GYGPNYILIKNS 176 LL +IA QPVSVV+++ G+ FQLY+GGIF GPCG KV+HAV AV G YILIKNSSbjct: 121 LLNAIAKQPVSVVVESKGRPFQLYKGGIFEGPCGTKVEHAVTAVGYGKSGGKGYILIKNS 180
Query: 177 WGTGWGENGYIRIKRGTGNSYGVCGLYTSSFYPVKN 212 WGT WGE GYIRIKR GNS GVCGLY SS+YP KNSbjct: 181 WGTAWGEKGYIRIKRAPGNSPGVCGLYKSSYYPTKN 216
1 lpenvdwrkk gavtpvrhqg scgscwafsa vatveginki rtgklvelse qelvdcerrs 61 hgckggyppy aleyvakngi hlrskypyka kqgtcrakqv ggpivktsgv grvqpnnegn 121 llnaiakqpv svvveskgrp fqlykggife gpcgtkveha vtavgygksg gkgyilikns 181 wgtawgekgy irikrapgns pgvcglykss yyptkn
Cacain and Caricain, two proteases from papaya
By Michael Schroeder, Biotec 32
Window size 15 Dot if
6 matches in window
Cacain and Cruzain, a protease from Trypanosoma cruzi
By Michael Schroeder, Biotec 33
>gi|2624670|pdb|1AIM| Cruzain Inhibited By Benzoyl-Tyrosine-Alanine- Fluoromethylketone
Length = 215
Score = 121 bits (303), Expect = 3e-28 Identities = 78/202 (38%), Positives = 107/202 (52%), Gaps = 13/202 (6%)
Query: 2 PEYVDWRQKGAVTPVKNQGSCGSCWAFSAVVTIEGIIKIRTGNLNQYSEQELLDCDRRSY 61 P VDWR +GAVT VK+QG CGSCWAFSA+ +E + L SEQ L+ CD+ Sbjct: 2 PAAVDWRARGAVTAVKDQGQCGSCWAFSAIGNVECQWFLAGHPLTNLSEQMLVSCDKTDS 61
Query: 62 GCNGGYPWSALQLVAQY---GIHYRNTYPY---EGVQRYCRSREKGPYAAKTDGVRQVQP 115 GC+GG +A + + Q ++ ++YPY EG+ C + A T V Q Sbjct: 62 GCSGGLMNNAFEWIVQENNGAVYTEDSYPYASGEGISPPCTTSGHTVGATITGHVELPQD 121
Query: 116 YNQGALLYSIANQPVSVVLQAAGKDFQLYRGGIFVGPCGNKVDHAVAAVGYGPN----YI 171 Q A ++ N PV+V + A+ + Y GG+ +DH V VGY + Y Sbjct: 122 EAQIAAWLAV-NGPVAVAVDAS--SWMTYTGGVMTSCVSEALDHGVLLVGYNDSAAVPYW 178
Query: 172 LIKNSWGTGWGENGYIRIKRGT 193 +IKNSW T WGE GYIRI +G+Sbjct: 179 IIKNSWTTQWGEEGYIRIAKGS 200
Cacain and Cruzain, a protease from Trypanosoma
By Michael Schroeder, Biotec 34
Window size 15 Dot if
6 matches in window
Cacain and Cathepsin, a human protease
By Michael Schroeder, Biotec 35
gi|7546546|pdb|1EF7|B Chain B, Crystal Structure Of Human Cathepsin X Length = 242
Score = 52.0 bits (123), Expect = 2e-07 Identities = 60/231 (25%), Positives = 94/231 (40%), Gaps = 34/231 (14%)
Query: 1 IPEYVDWRQKGAV---TPVKNQ---GSCGSCWAFSAVVTIEGIIKIRTGNL---NQYSEQ 51 +P+ DWR V + +NQ CGSCWA ++ + I I+ S QSbjct: 1 LPKSWDWRNVDGVNYASITRNQHIPQYCGSCWAHASTSAMADRINIKRKGAWPSTLLSVQ 60
Query: 52 ELLDCDRRSYGCNGGYPWSALQLVAQYGIHYRNTYPYEGVQRYCR--------SREKGPY 103 ++DC C GG S Q+GI Y+ + C + K +Sbjct: 61 NVIDCGNAG-SCEGGNDLSVWDYAHQHGIPDETCNNYQAKDQECDKFNQCGTCNEFKECH 119
Query: 104 AAKTDGVRQVQPYN-----QGALLYSIANQPVSVVLQAAGKDFQLYRGGIFVGPCGNK-V 157 A + + +V Y + + AN P+S + A + Y GGI+ +Sbjct: 120 AIRNYTLWRVGDYGSLSGREKMMAEIYANGPISCGIMATER-LANYTGGIYAEYQDTTYI 178
Query: 158 DHAVAAVGY----GPNYILIKNSWGTGWGENGYIRI-----KRGTGNSYGV 199 +H V+ G+ G Y +++NSWG WGE G++RI K G G Y +Sbjct: 179 NHVVSVAGWGISDGTEYWIVRNSWGEPWGERGWLRIVTSTYKDGKGARYNL 229
Cacain and Cathepsin, a human protease
By Michael Schroeder, Biotec 36
Window size 5 Dot if
2 matches in window
Cacain and Cathepsin, a human protease
By Michael Schroeder, Biotec 37
Window size 1 Dot if
1 match in window
Cacain and Cathepsin, a human protease
By Michael Schroeder, Biotec 38
Dynamic programming
By Michael Schroeder, Biotec 39
From Dotplots to Alignments Obvious best alignment:
DOROTHYCROWFOOTHODGKINDOROTHY--------
HODGKIN
NN
II
KK
GG
DDD
OOOOOOO
HHH
YY
HHH
TTT
OOOOOOO
RRR
OOOOOOO
DDD
NIKGDOHTOOFWORCYHTOROD
By Michael Schroeder, Biotec 40
From Dotplots to Alignments
Find “best” path from top left corner to bottom right Moving “east” corresponds to “-” in the second
sequence Moving “south” corresponds to “-” in the first
sequence Moving “southeast” corresponds to
a match (if the characters are the same) or a mismatch (otherwise)
Can we automate this?
By Michael Schroeder, Biotec 41
From Dotplots to Alignments
Algorithm (Dynamic Programming): Insert a row 0 and column 0 initialised with 0 Starting from the top left, move down row by row from row 1 and
right column by column from column 1 visiting each cell Consider
The value of the cell north The value of the cell west The value of the cell northwest if the row/column character
mismatch 1 + the value of the cell northwest if the row/column
character match Put down the maximum of these values as the value for the
current cell Trace back the path with the highest values from the bottom right
to the top left and output the alignment
By Michael Schroeder, Biotec 42
From Dotplots to Alignments
0 1 2 3 4 56
T G C A TA0 1 A2 T3 C4 T5 G6 A7 T
By Michael Schroeder, Biotec 43
From Dotplots to Alignments
0 1 2 3 4 56
T G C A TA0 0 0 0 0 0 001 A 02 T 03 C 04 T 05 G 06 A 07 T 0
Insert a row 0 and column 0 initialised with 0
By Michael Schroeder, Biotec 44
From Dotplots to Alignments
0 1 2 3 4 56
T G C A TA0 0 0 0 0 0 001 A 0 02 T 03 C 04 T 05 G 06 A 07 T 0
• Consider• Value north• Value west• Value northwest if the row/column character mismatch• 1 + value northwest if the row/column character match
• Put down the maximum of these values for current celll
0 0 1 1 1
By Michael Schroeder, Biotec 45
From Dotplots to Alignments
0 1 2 3 4 56
T G C A TA0 0 0 0 0 0 001 A 0 0 0 0 1 1 12 T 0 1 1 1 1 2 23 C 0 1 1 2 2 2 24 T 0 1 1 2 2 3 35 G 0 1 2 2 2 3 36 A 0 1 2 2 3 3 47 T 0 1 2 2 3 4 4
By Michael Schroeder, Biotec 46
Reading the Alignment
0 1 2 3 4 56
T G C A TA0 0 0 0 0 0 001 A 0 0 0 0 1 1 12 T 0 1 1 1 1 2 23 C 0 1 1 2 2 2 24 T 0 1 1 2 2 3 35 G 0 1 2 2 2 3 36 A 0 1 2 2 3 3 47 T 0 1 2 2 3 4 4-tgcat-a-
at-c-tgat
By Michael Schroeder, Biotec 47
Reading the Alignment: there are more than one possibility
0 1 2 3 4 56
T G C A TA0 0 0 0 0 0 001 A 0 0 0 0 1 1 12 T 0 1 1 1 1 2 23 C 0 1 1 2 2 2 24 T 0 1 1 2 2 3 35 G 0 1 2 2 2 3 36 A 0 1 2 2 3 3 47 T 0 1 2 2 3 4 4---tgcata
atctg-at-
By Michael Schroeder, Biotec 48
Formally:Longest Common Subsequence LCS What is the length s(V,W) of the longest common
subsequence of two sequencesV=v1..vn and W=w1..wm ?
Find sequences of indices1 ≤ i1 < … < ik ≤ n and 1 ≤ j1 < … < jk ≤ msuch that vit
= wjt for 1 ≤ t ≤ k
How? Dynamic programming: si,0 = s0,j = 0 for all 1 ≤ i ≤ n and 1 ≤ j ≤ m and si-1,j
si,j = max si,j-1
si-1,j-1 + 1, if vi = wj
Then s(V,W) = sn,m is the length of the LCS
{
By Michael Schroeder, Biotec 49
Example LCS
0 1 2 3 4 56
T G C A TA0 1 A2 T3 C4 T5 G6 A7 T
By Michael Schroeder, Biotec 50
Example LCS:
0 1 2 3 4 56
T G C A TA0 0 0 0 0 0 001 A 02 T 03 C 04 T 05 G 06 A 07 T 0
Initialisation: si,0 = s0,j = 0 for all 1 ≤ i ≤ n and 1 ≤ j ≤ m
By Michael Schroeder, Biotec 51
Example LCS:
0 1 2 3 4 56
T G C A TA0 0 0 0 0 0 001 A 0 0 0 0 1 1 12 T 03 C 04 T 05 G 06 A 07 T 0
Computing each cell: si-1,j
si,j = max si,j-1
si-1,j-1 + 1, if vi = wj
{
By Michael Schroeder, Biotec 52
Example LCS:
0 1 2 3 4 56
T G C A TA0 0 0 0 0 0 001 A 0 0 0 0 1 1 12 T 0 1 1 1 1 2 23 C 0 1 1 2 2 2 24 T 0 1 1 2 2 3 35 G 0 1 2 2 2 3 36 A 0 1 2 2 3 3 47 T 0 1 2 2 3 4 4
Computing each cell: si-1,j
si,j = max si,j-1
si-1,j-1 + 1, if vi = wj
{
By Michael Schroeder, Biotec 53
LCS Algorithm
LCS(V,W) For i = 0 to n
si,0 = 0 For j = 0 to m
s0,j = 0 For i = 1 to n
For j = 1 to m If vi = wj and si-1,j-1 +1 ≥ si-1,j and si-1,j-1 +1 ≥ si,j-1 Then
si,j = si-1,j-1 +1 bi,j = North West
Else if si-1,j ≥ si,j-1 Then si,j = si-1,j
bi,j = North Else
si,j = si,j-1
bi,j = West Return s and b
Complexity: LCS has quadratic complexity:
O(n m)
By Michael Schroeder, Biotec 54
Printing the alignment of LCS
PRINT-LCS(b,V,i,j) If i=0 or j=0 Then Return If bi,j = North West Then
PRINT-LCS(V,b,i-1,j-1) Print vi
Else if bi,j = North Then PRINT-LCS(V,b,i-1,j)
Else PRINT-LCS(V,b,i,j-1)
By Michael Schroeder, Biotec 55
Rewards/Penalities
We can use different schemes: -1 for insert/delete/mismatch +1 for match
…Consider -1 + the value of the cell north -1 + the value of the cell west -1 + the value of the cell northwest if the row/column
character mismatch +1 + the value of the cell northwest if the row/column
character match Put down the maximum of these values as the value for
the current cell
By Michael Schroeder, Biotec 56
Reading the Alignment
0 1 2 3 4 56
T G C A TA0 0 0 0 0 0 001 A 0 -1 -1 -1 1 0 12 T 0 1 0 -1 0 2 13 C 0 0 -1 1 0 1 14 T 0 1 0 0 0 1 05 G 0 0 2 1 0 0 06 A 0 -1 1 1 2 1 17 T 0 1 0 0 1 3 2---tgcata
atctg-at-
By Michael Schroeder, Biotec 57
Rewards/Penalities Let’s refine the schemes:
Transition mutations are more common: purine<->purine, a<->g pyrimidine<->pyrimidine, t<->c
Transversions (purine<->pyrimidine) are less common
Use a subsitutation matrix to rate mismatches:
-2 for insert/delete Mismatch/match according to substitution matrix
…Consider -2 + the value of the cell north -2 + the value of the cell west Corresponding value of the substion matrix
+ the value of the cell northwest Put down the maximum of these values as the
value for the current cell
2-20-2C
-22-20G
0-22-2T
-20-22A
CGTA
By Michael Schroeder, Biotec 58
Reading the Alignment
0 1 2 3 4 56
T G C A TA0 0 0 0 0 0 001 A 0 -2 0 -2 2 0 22 T 0 2 0 0 0 4 23 C 0 0 0 2 0 2 24 T 0 2 0 0 0 2 05 G 0 0 4 2 0 0 26 A 0 -2 2 2 4 2 27 T 0 2 0 2 2 6 4---tgcata
atctg-at-
By Michael Schroeder, Biotec 59
Substitution matrixes
By Michael Schroeder, Biotec 60
How to derive a substitution matrix for amino acids?
Amino acids can be classified by physiochemical properties
HydrophobicA
GP
I L V
C W
M F
AcidicDE
PolarS T
N Q
Y
H
Aromatic
K
R Basic
By Michael Schroeder, Biotec 61
PAM 250 matrixCys 12Ser 0 2Thr -2 1 3Pro -3 1 0 6Ala -2 1 1 1 2Gly -3 1 0 -1 1 5Asn -4 1 0 -1 0 0 2Asp -5 0 0 -1 0 1 2 4Glu -5 0 0 -1 0 0 1 3 4Gln -5 -1 -1 0 0 -1 1 2 2 4His -3 -1 -1 0 -1 -2 2 1 1 3 6Arg -4 0 -1 0 -2 -3 0 -1 -1 1 2 6Lys -5 0 0 -1 -1 -2 1 0 0 1 0 3 5Met -5 -2 -1 -2 -1 -3 -2 -3 -2 -1 -2 0 0 6Ile -2 -1 0 -2 -1 -3 -2 -2 -2 -2 -2 -2 -2 2 5Leu -6 -3 -2 -3 -2 -4 -3 -4 -3 -2 -2 -3 -3 4 2 6Val -2 -1 0 -1 0 -1 -2 -2 -2 -2 -2 -2 -2 2 4 2 4Phe -4 -3 -3 -5 -4 -5 -4 -6 -5 -5 -2 -4 -5 0 1 2 -1 9Tyr 0 -3 -3 -5 -3 -5 -2 -4 -4 -4 0 -4 -4 -2 -1 -1 -2 7 10Trp -8 -2 -5 -6 -6 -7 -4 -7 -7 -5 -3 2 -3 -4 -5 -2 -6 0 0 17 C S T P A G N D E Q H R K M I L V F Y W
>0, likely mutation0, random mutation<0, unlikely
By Michael Schroeder, Biotec 62
Cys 12 0 -2 -3 -2 -3 -4 -5 -5 -5 -3 -4 -5 -5 -2 -6 -2 -4 0 -8
Ser 0 2 1 1 1 1 1 0 0 -1 -1 0 0 -2 -1 -3 -1 -3 -3 -2
Thr -2 1 3 0 1 0 0 0 0 -1 -1 -1 0 -1 0 -2 0 -3 -3 -5
Pro -3 1 0 6 1 -1 -1 -1 -1 0 0 0 -1 -2 -2 -3 -1 -5 -5 -6
Ala -2 1 1 1 2 1 0 0 0 0 -1 -2 -1 -1 -1 -2 0 -4 -3 -6
Gly -3 1 0 -1 1 5 0 1 0 -1 -2 -3 -2 -3 -3 -4 -1 -5 -5 -7
Asn -4 1 0 -1 0 0 2 2 1 1 2 0 1 -2 -2 -3 -2 -4 -2 -4
Asp -5 0 0 -1 0 1 2 4 3 2 1 -1 0 -3 -2 -4 -2 -6 -4 -7
Glu -5 0 0 -1 0 0 1 3 4 2 1 -1 0 -2 -2 -3 -2 -5 -4 -7
Gln -5 -1 -1 0 0 -1 1 2 2 4 3 1 1 -1 -2 -2 -2 -5 -4 -5
His -3 -1 -1 0 -1 -2 2 1 1 3 6 2 0 -2 -2 -2 -2 -2 0 -3
Arg -4 0 -1 0 -2 -3 0 -1 -1 1 2 6 3 0 -2 -3 -2 -4 -4 2
Lys -5 0 0 -1 -1 -2 1 0 0 1 0 3 5 0 -2 -3 -2 -5 -4 -3
Met -5 -2 -1 -2 -1 -3 -2 -3 -2 -1 -2 0 0 6 2 4 2 0 -2 -4
Ile -2 -1 0 -2 -1 -3 -2 -2 -2 -2 -2 -2 -2 2 5 2 4 1 -1 -5
Leu -6 -3 -2 -3 -2 -4 -3 -4 -3 -2 -2 -3 -3 4 2 6 2 2 -1 -2
Val -2 -1 0 -1 0 -1 -2 -2 -2 -2 -2 -2 -2 2 4 2 4 -1 -2 -6
Phe -4 -3 -3 -5 -4 -5 -4 -6 -5 -5 -2 -4 -5 0 1 2 -1 9 7 0
Tyr 0 -3 -3 -5 -3 -5 -2 -4 -4 -4 0 -4 -4 -2 -1 -1 -2 7 10 0
Trp -8 -2 -5 -6 -6 -7 -4 -7 -7 -5 -3 2 -3 -4 -5 -2 -6 0 0 17
C S T P A G N D E Q H R K M I L V F Y W
Average -2.8 -0.5 -0.7 -1.2 -0.9 -1.6 -0.7 -1.1 -1.1 -0.8 -0.3 -0.7 -0.9 -0.8 -0.8 -1.4 -0.8 -1.9 -1.5 -3.1
StDev 4 1.5 1.7 2.6 1.9 2.7 2.1 3 2.8 2.6 2.3 2.6 2.5 2.6 2.4 3 2.3 4.1 3.8 5.4
By Michael Schroeder, Biotec 63
Cys 12 0 -2 -3 -2 -3 -4 -5 -5 -5 -3 -4 -5 -5 -2 -6 -2 -4 0 -8
Ser 0 2 1 1 1 1 1 0 0 -1 -1 0 0 -2 -1 -3 -1 -3 -3 -2
Thr -2 1 3 0 1 0 0 0 0 -1 -1 -1 0 -1 0 -2 0 -3 -3 -5
Pro -3 1 0 6 1 -1 -1 -1 -1 0 0 0 -1 -2 -2 -3 -1 -5 -5 -6
Ala -2 1 1 1 2 1 0 0 0 0 -1 -2 -1 -1 -1 -2 0 -4 -3 -6
Gly -3 1 0 -1 1 5 0 1 0 -1 -2 -3 -2 -3 -3 -4 -1 -5 -5 -7
Asn -4 1 0 -1 0 0 2 2 1 1 2 0 1 -2 -2 -3 -2 -4 -2 -4
Asp -5 0 0 -1 0 1 2 4 3 2 1 -1 0 -3 -2 -4 -2 -6 -4 -7
Glu -5 0 0 -1 0 0 1 3 4 2 1 -1 0 -2 -2 -3 -2 -5 -4 -7
Gln -5 -1 -1 0 0 -1 1 2 2 4 3 1 1 -1 -2 -2 -2 -5 -4 -5
His -3 -1 -1 0 -1 -2 2 1 1 3 6 2 0 -2 -2 -2 -2 -2 0 -3
Arg -4 0 -1 0 -2 -3 0 -1 -1 1 2 6 3 0 -2 -3 -2 -4 -4 2
Lys -5 0 0 -1 -1 -2 1 0 0 1 0 3 5 0 -2 -3 -2 -5 -4 -3
Met -5 -2 -1 -2 -1 -3 -2 -3 -2 -1 -2 0 0 6 2 4 2 0 -2 -4
Ile -2 -1 0 -2 -1 -3 -2 -2 -2 -2 -2 -2 -2 2 5 2 4 1 -1 -5
Leu -6 -3 -2 -3 -2 -4 -3 -4 -3 -2 -2 -3 -3 4 2 6 2 2 -1 -2
Val -2 -1 0 -1 0 -1 -2 -2 -2 -2 -2 -2 -2 2 4 2 4 -1 -2 -6
Phe -4 -3 -3 -5 -4 -5 -4 -6 -5 -5 -2 -4 -5 0 1 2 -1 9 7 0
Tyr 0 -3 -3 -5 -3 -5 -2 -4 -4 -4 0 -4 -4 -2 -1 -1 -2 7 10 0
Trp -8 -2 -5 -6 -6 -7 -4 -7 -7 -5 -3 2 -3 -4 -5 -2 -6 0 0 17
C S T P A G N D E Q H R K M I L V F Y W
Average -2.8 -0.5 -0.7 -1.2 -0.9 -1.6 -0.7 -1.1 -1.1 -0.8 -0.3 -0.7 -0.9 -0.8 -0.8 -1.4 -0.8 -1.9 -1.5 -3.1
StDev 4 1.5 1.7 2.6 1.9 2.7 2.1 3 2.8 2.6 2.3 2.6 2.5 2.6 2.4 3 2.3 4.1 3.8 5.4
By Michael Schroeder, Biotec 64
PAM 250: Interpretation
Immutable: Cysteine (Avg=-2.8): known to have several unique,
indispensable functions attachment site of heme group in cytochrome and of iron
sulphur FeS in ferredoxins Cross links in proteins such as chymotrypsin or ribonuclease Seldom without unique function
Glycine (Avg=-1.6): small size maybe advantageous Mutable:
Serine often functions in active site, but can be easily replaced Self-alignment:
Tryptophan with itself scores very high, as W occurs rarely
By Michael Schroeder, Biotec 65
Point Accepted Mutations PAM
Substitution matrix using explicit evolutionary model of how amino acids change over time
Use parsimony method to determine frequency of mutations Entry in PAM matrix: Likelihood ratio for residues a and b: Probability
a-b is a mutation / probability a-b is chance PAM x: Two sequences V, W have evolutionary distance of x PAM if
a series of accepted point mutations (and no insertions/deletions) converts V into W averaging to x point mutation per 100 residues
Mutations here = mutations in the DNA Because of silent mutations and back mutations n can be >100 PAM 250 most commonly used
By Michael Schroeder, Biotec 66
PAM and Sequence Similarity
PAM 0 30 80 110 200 250
% identiy 100 75 60 50 25 20
By Michael Schroeder, Biotec 67
PAM
Dayhoff, Eck, Park: A model of evolutionary change in proteins, 1978
Accepted point mutation = substitution of an amino acid accepted by natureal selection
Assumption: X replacing Y as likely as Y replacing X
Used cytochrome c, hemoglobin, myoglobin, virus coat proteins, chymotrypsinogen, glyceraldehyde 3-phosphate dehrydogenase, clupeine, insulin, ferredoxin
Sequences which are too distantly related have been omitted as they are more likely to contain multiple mutations per site
By Michael Schroeder, Biotec 68
PAM: Step 1
Step 1: Construct a multiple alignment
Example ACGCTAFKI GCGCTAFKI ACGCTAFKL GCGCTGFKI GCGCTLFKI ASGCTAFKL ACACTAFKL
By Michael Schroeder, Biotec 69
PAM: Step 2
Create a phylogenetic tree (parsimony method)
ACGCTAFKI
A->G I->L
GCGCTAFKI ACGCTAFKL
A->G A->L C->S G->A
GCGCTGFKI GCGCTLFKI ASGCTAFKL ACACTAFKL
By Michael Schroeder, Biotec 70
PAM: Step 3
Note, the following variables Residue frequency ri is the number of amino acid i
occurring in the sequences, e.g. rA = 10 and rG=10
Number of residues r is the number of overall amino acids in all sequences, r =Σi ri e.g. r=63
Probability of residue i is pi = ri / r
Probability of amino acid A is pA = 10/63 = 0,16
By Michael Schroeder, Biotec 71
PAM: Step 3 Substitution frequency:
si,j is the number of substitutions involving amino acid i and j (i.e. the number of ij and ji ) E.g. sA,G = 3
Substitutability: si is the number of substitutions in the tree involving amino acid i , si =Σj≠i si,j sA=sA,G+sA,L+sA,C+sA,S+… = 3+1+0+0+... = 4
Number of substitutions: s =Σi si s is twice the number of substitutions, here s=12
By Michael Schroeder, Biotec 72
PAM: Step 4
Relative mutability mi
Probability that residue i will mutate
Relative mutability depends on
si: mi should increase with increasing si
i in many mutations indicates high mutability
pi: mi should decrease with increasing pi
many i indicate many i in mutations due to frequent
occurrance
ri
By Michael Schroeder, Biotec 73
PAM: Step 4
Relative mutability Probability that arbitrary mutation contains i is si / 0.5*s Probability that mutation is from i is 0.5*si / 0.5*s = si / s
Among 100 residues, there are 100 pi occurrances of i
Hence, relative mutability is defined as mi = 1/(100*pi) * si / s
mA = 1/(100*(10/63)) * 4/12 = 0.021
By Michael Schroeder, Biotec 74
PAM: Step 5
Compute mutation probability Probability that i mutate is mi
Probability that i mutates into j is sij / si
Mi,j = mi x si,j / si , where i not equal j
Mi,i = 1-mi (probability that i does not mutate)
By Michael Schroeder, Biotec 75
PAM: Step 6
Finally the entry in the PAM Matrix:
Ri,j = log ( Mi,j / pj) )
By Michael Schroeder, Biotec 76
BLOSUM Different approach to PAM BLOcks SUbstitution Matrix (based on BLOCKS
database) Generation of BLOSUM x
Group highly similar sequences and replace them by a representative sequences.
Only consider sequences with no more than x % similarity Align sequences (no gaps) For any pair of amino acids a,b and for all columns c of the
alignment, let q(a,b) be the number of co-occurrences of a,b in all columns c.
Let p(a) be the overall probability of a occurring
BLOSUM entry for a,b is log2 ( q(a,b) / ( p(a)*p(b) ) )
BLOSUM 50 and BLOSUM 62 widely used
By Michael Schroeder, Biotec 77
LCS Algorithm (Longest Common Subsequence) Revisited
Algorithm (Dynamic Programming) with Substitution Matrix: Insert a row 0 and column 0 initialised with 0 Starting from the top left, move down row by row from row 1 and
right column by column from column 1 visiting each cell Consider
The value of the cell north The value of the cell west The value of the cell northwest if the row/column character
mismatch s + the value of the cell northwest, where s is the value
in the subsitution matrix for the residues in row/column Put down the minimum of these values as the value for the
current cell Trace back the path with the highest values from the bottom right
to the top left and output the alignment
By Michael Schroeder, Biotec 78
LCS Revisited: Formally
What is the length s(V,W) of the longest common subsequence of two sequencesV=v1..vn and W=w1..wm ?
Find sequences of indices1 ≤ i1 < … < ik ≤ n and 1 ≤ j1 < … < jk ≤ msuch that vit
= wjt for 1 ≤ t ≤ k
How? Dynamic programming: si,0 = s0,j = 0 for all 1 ≤ i ≤ n and 1 ≤ j ≤ m and
si-1,j
si,j = max si,j-1
si-1,j-1 + t, where t is the value for vi and wj in
the substitution matrix
Then s(V,W) = sn,m is the length of the LCS
{
By Michael Schroeder, Biotec 79
Dynamic programming revisited:local and global alignments and gap
By Michael Schroeder, Biotec 80
Evolution and Alignments
Alignments can be interpreted in evolutionary terms Identical letters are aligned.
Interpretation: part of the same ancestral sequence and not changed
Non-identical letters are aligned (substitution)Interpretation: Mutation
GapsInterpretation: Insertions and deletions (indels)
By Michael Schroeder, Biotec 81
Evolution and Alignments
Specific problems aligning DNA: “Frame shift”:
DNA triplets code amino acids Indel of one nucleotide shifts the whole sequence of
triplets Thus may have a global effect and change all coded
amino acids Silent mutation:
Substitution in DNA leaves transcribed amino acid unchanged
Non-sense mutation: Substitution to stop-codon
By Michael Schroeder, Biotec 82
Local and Global Alignments
Global alignment (Needleham-Wunsch) algorithm finds overall best alignment Example: members of a protein family, e.g. globins are very
conserved and have the same length in different organisms from fruit fly to humans
Local alignment (Smith-Waterman) algorithm finds locally best alignment most widely used, as
e.g. genes from different organisms retain similar exons, but may have different introns
e.g. homeobox gene, which regulates embryonic development occurs in many species, but very different apart from one region called homeodomain
e.g. proteins share some domains, but not all
By Michael Schroeder, Biotec 83
Local Alignment
LCS s(V,W) computes globally best alignment Often it is better to maximise locally, i.e. compute
maximal s(vi…vi’ , wj… wi’ ) for all substrings of V and W
Can we adapt algorithm? Global alignment = longest path in matrix s from (0,0)
to (n,m) Local alignment = longest path in matrix s from any
(i,j) to any (i’,j’) Modify definition of s adding vertex of weight 0 from
source to every other vertex, creating a free “jump” to any starting position (i,j)
By Michael Schroeder, Biotec 84
Local Alignment
Modify the definition of s as follows: si,0 = s0,j = 0 for all 1 ≤ i ≤ n and 1 ≤ j ≤ m and 0
si-1,j
si,j = max si,j-1
si-1,j-1 + t, where t is the value for vi wj
in the substitution matrix
Then s(V,W) = max { si,j } is the length of the local LCS
This computes longest path in edit graph Several local alignment may have biological
significance (consider e.g. two multi-domain proteins whose domains are re-ordered
{
By Michael Schroeder, Biotec 85
Aligning with Gap Penalties
Gap is sequence of spaces in alignment So far, we consider only insertion and deletion of single
nucleotides or amino acids creating alignments with many gaps So far, score of a gap of length l is l Because insertion/deletion of monomers is evolutionary slow
process, large numbers of gaps do not make sense Instead whole substrings will be deleted or inserted We can generalise score of a gap to a score function A + B l,
where A is the penalty to open the gap and B is the penalty to extend the gap
By Michael Schroeder, Biotec 86
Aligning with Gap Penalties
High gap penalties result in shorter, lower-scoring alignments with fewer gaps and
Lower gap penalties give higher-scoring, longer alignments with more gaps
Gap opening penalty A mainly influences number of gaps
Gap extension penalty B mainly influences length of gaps
E.g. if interested in close relationships, then choose A, B above default values, for distant relationships decrease default values
By Michael Schroeder, Biotec 87
Aligning with Gap Penalties
Adapt the definition of s as follows: s-deli,j = max s-deli-1,j - B
si-1,j – (A+B)
s-insi,j = max s-insi,j-1 - B
si,j-1 – (A+B)
0 s-deli,jsi,j = max s-insi,j
si-1,j-1 + t, where t is the value for vi, wj
in the substitution matrix Then s(V,W) = max { si,j } is the length of the local LCS with gap penalties A and B
{
{{
By Michael Schroeder, Biotec 88
FASTA and BLAST
By Michael Schroeder, Biotec 89
Motivation As in dotplots, the underlying data structure for dynamic
programming is a table Given two sequences of length n dynamic programming
takes time proportional to n2
Given a database with m sequences, comparing a query sequence to the whole database takes time proportional to m n2
What does this mean? Imagine you need to fill in the tables by hand and it takes 10
second to fill in one cell Assume there are 1.000.000 sequences each 100 amino acids
long How long does it take?
By Michael Schroeder, Biotec 90
1.000.000 x 100 x 100 x 10 sec = 1011 sec = 27.777.778h = 1157407days = 3170 years
Even if a computer does not take 10 sec, but just 0.1ms to fill in one cell, it would still be 12 days.
We cannot do something about the database size, but can we do something about the table size?
By Michael Schroeder, Biotec 91
An idea: Prune the search space
By Michael Schroeder, Biotec 92
Another idea Did we formulate the
problem correctly? Do we need the alignments
for all sequences in the database?
No, only for “reasonable” hits introduce a threshold
A “reasonable” alignment will contain short stretches of perfect matches
Find these first, then extend them to connect them as best possible
By Michael Schroeder, Biotec 93
FASTA and BLAST
FASTA and BLAST faster than dynamic programming (5 times and 50 times respectively)
Underlying idea for a heuristic: High-scoring alignments will contain short stretches
of identical letters, called words FASTA and BLAST first search for matches of words of
a given length and score threshold: BLAST for words of length 3 for proteins and 11 for
DNA FASTA for words of length 2 for proteins and 6 for
DNA Next, matches are extended to local (BLAST) and
global (FASTA) alignments
By Michael Schroeder, Biotec 94
FASTA and BLAST More formally:
If the strings V=v1..vm and W=w1..wm match with at most k mismatches, then they share an p-tuple for
p = m/(k+1), i.e. vi..vi+l-1 =wj..wj+l-1 for some 1 ≤ i,j ≤ m-p+1
FILTRATION ALGORITHM, which detects all matching words of length m with up to k mismatches Potential match detection: Find all matches of p-tuples
of V,W (can be done in linear time by inserting them into a hash table)
Potential match verification: Verify each potential match by extending it to the left and right until either the first k+1 mismatches are found or the beginning or end of the sequences are found
By Michael Schroeder, Biotec 95
Example for BLAST Search SWISSPROT for Immunoglobulin:
SWISS_PROT:C79A_HUMAN P11912
By Michael Schroeder, Biotec 96
Example for BLAST
Search BLAST (www.ncbi.nlm.nih.gov/BLAST/) for P11912
Database: All non-redundant SwissProt sequences
1,292,592 sequences; 412,925,052 total letters
By Michael Schroeder, Biotec 97
Example for BLAST Distribution of Hits:
By Michael Schroeder, Biotec 98
Example for BLAST: Top Hits Score E Sequences producing significant alignments: Score E-Value gi|
547896|sp|P11912|C79A_HUMAN B-cell antigen receptor comp... 473 e-133 gi|728993|sp|P40293|C79A_BOVIN B-cell antigen receptor comp... 312 3e-85 gi|126779|sp|P11911|C79A_MOUSE B-cell antigen receptor comp... 278 5e-75 gi|728994|sp|P40259|C79B_HUMAN B-cell antigen receptor comp... 55 1e-07 gi|125781|sp|P01618|KV1_CANFA IG KAPPA CHAIN V REGION GOM 38 0.019 gi|125361|sp|P17948|VGR1_HUMAN Vascular endothelial growth ... 37 0.042 gi|549319|sp|P35969|VGR1_MOUSE Vascular endothelial growth ... 36 0.052 gi|114764|sp|P15530|C79B_MOUSE B-cell antigen receptor comp... 36 0.064 gi|1718161|sp|P53767|VGR1_RAT Vascular endothelial growth f... 35 0.080 gi|125735|sp|P01681|KV01_RAT Ig kappa chain V region S211 35 0.095 gi|1730075|sp|P01625|KV4A_HUMAN IG KAPPA CHAIN V-IV REGION LEN 34 0.26 gi|1718188|sp|P52583|VGR2_COTJA Vascular endothelial growth... 33 0.28 gi|125833|sp|P06313|KV4B_HUMAN IG KAPPA CHAIN V-IV REGION J... 33 0.30 gi|125806|sp|P01658|KV3F_MOUSE IG KAPPA CHAIN V-III REGION ... 33 0.30 gi|125808|sp|P01659|KV3G_MOUSE IG KAPPA CHAIN V-III REGION ... 33 0.30 gi|1172451|sp|Q05793|PGBM_MOUSE Basement membrane-specific ... 33 0.33 gi|125850|sp|P01648|KV5O_MOUSE Ig kappa chain V-V region HP... 33 0.36 gi|125830|sp|P06312|KV40_HUMAN Ig kappa chain V-IV region p... 33 0.38 gi|2501738|sp|Q06639|YD03_YEAST Putative 101.7 kDa transcri... 33 0.41
By Michael Schroeder, Biotec 99
Example for BLAST: Alignment>gi|126779|sp|P11911|C79A_MOUSE B-cell antigen receptor complex associated protein alpha-chainprecursor (IG-alpha) (MB-1 membrane glycoprotein)(Surface-IGM-associated protein) (Membrane-boundimmunoglobulin associated protein) (CD79A)Length = 220
Score = 278 bits (711), Expect = 5e-75Identities = 150/226 (66%), Positives = 165/226 (73%), Gaps = 6/226 (2%)
Query: 1 MPGGPGVLQALPATIFLLFLLSAVYLGPGCQALWMHKVPASLMVSLGEDAHFQCPHNSSN 60 MPGG + LL LS LGPGCQAL + P SL V+LGE+A C N+ Sbjct: 1 MPGG----LEALRALPLLLFLSYACLGPGCQALRVEGGPPSLTVNLGEEARLTC-ENNGR 55
Query: 61 NANVTWWRVLHGNYTWPPEFLGPGEDPNGTLIIQNVNKSHGGIYVCRVQEGNESYQQSCG 120 N N+TWW L N TWPP LGPG+ G L VNK+ G C+V E N ++SCGSbjct: 56 NPNITWWFSLQSNITWPPVPLGPGQGTTGQLFFPEVNKNTGACTGCQVIE-NNILKRSCG 114
Query: 121 TYLRVRQPPPRPFLDMGEGTKNRIITAEGIILLFCAVVPGTLLLFRKRWQNEKLGLDAGD 180 TYLRVR P PRPFLDMGEGTKNRIITAEGIILLFCAVVPGTLLLFRKRWQNEK G+D DSbjct: 115 TYLRVRNPVPRPFLDMGEGTKNRIITAEGIILLFCAVVPGTLLLFRKRWQNEKFGVDMPD 174
Query: 181 EYEDENLYEGLNLDDCSMYEDISRGLQGTYQDVGSLNIGDVQLEKP 226 +YEDENLYEGLNLDDCSMYEDISRGLQGTYQDVG+L+IGD QLEKPSbjct: 175 DYEDENLYEGLNLDDCSMYEDISRGLQGTYQDVGNLHIGDAQLEKP 220
By Michael Schroeder, Biotec 100
Example for BLAST Lineage Report
root . cellular organisms . . Eukaryota [eukaryotes] . . . Fungi/Metazoa group [eukaryotes] . . . . Bilateria [animals] . . . . . Coelomata [animals] . . . . . . Gnathostomata [vertebrates] . . . . . . . Tetrapoda [vertebrates] . . . . . . . . Amniota [vertebrates] . . . . . . . . . Eutheria [mammals] . . . . . . . . . . Homo sapiens (man) ---------------------- 473 33 hits [mammals] B-cell antigen receptor complex associated protein alpha-ch . . . . . . . . . . Bos taurus (bovine) ..................... 312 2 hits [mammals] B-cell antigen receptor complex associated protein alpha-ch . . . . . . . . . . Mus musculus (mouse) .................... 278 31 hits [mammals] B-cell antigen receptor complex associated protein alpha-ch . . . . . . . . . . Canis familiaris (dogs) ................. 37 1 hit [mammals] IG KAPPA CHAIN V REGION GOM . . . . . . . . . . Rattus norvegicus (brown rat) ........... 35 7 hits [mammals] Vascular endothelial growth factor receptor 1 precursor (VE . . . . . . . . . . Oryctolagus cuniculus (domestic rabbit) . 29 1 hit [mammals] IG KAPPA CHAIN V REGION K29-213 . . . . . . . . . Coturnix japonica ------------------------- 33 2 hits [birds] Vascular endothelial growth factor receptor 2 precursor (VE . . . . . . . . . Gallus gallus (chickens) .................. 31 4 hits [birds] CILIARY NEUROTROPHIC FACTOR RECEPTOR ALPHA PRECURSOR (CNTFR . . . . . . . . Xenopus laevis (clawed frog) ---------------- 30 2 hits [amphibians] Neural cell adhesion molecule 1, 180 kDa isoform precursor . . . . . . . Heterodontus francisci ------------------------ 28 1 hit [sharks and rays] Myelin P0 protein precursor (Myelin protein zero) (Myelin p . . . . . . Drosophila melanogaster ------------------------- 30 2 hits [flies] Neuroglian precursor . . . . . Caenorhabditis elegans ---------------------------- 29 1 hit [nematodes] Hypothetical protein F59B2.12 in chromosome III . . . . Saccharomyces cerevisiae (brewer's yeast) ----------- 33 1 hit [ascomycetes] Putative 101.7 kDa transcriptional regulatory protein in PR . . . Marchantia polymorpha --------------------------------- 29 1 hit [liverworts] Succinate dehydrogenase cytochrome b560 subunit (Succinate . . Agrobacterium tumefaciens str. C58 ---------------------- 28 1 hit [a-proteobacteria] Formamidopyrimidine-DNA glycosylase (Fapy-DNA glycosylase) . Human adenovirus type 3 ----------------------------------- 30 1 hit [viruses] EARLY E3 20.5 KD GLYCOPROTEIN . Human adenovirus type 7 ................................... 30 1 hit [viruses] EARLY E3 20.5 KD GLYCOPROTEIN
By Michael Schroeder, Biotec 101
How good is an alignment?
Be careful: Fitch/Smith found 17 alignments for alpha- and beta-chains in chicken haemoglobins
Only one is the correct one (according to the structure)
Given an alignment, how good is it : Percentage of matching residues, i.e. number of matches divided
by length of smallest sequence Advantage: independent of sequence length E.g. AT–C –TGAT 4/6 = 66.67%
–TGCAT –A–
More general: also consider gaps, extensions,…
By Michael Schroeder, Biotec 102
Blast Raw Score
R = a I + b X - c O - d G, where I is the number of identities in the alignment and a is
the reward for each identity X is the number of mismatches in the alignment and
b is the “reward” for each mismatch O is the number of gaps and c is the penalty for each
gap G is the number of “-” characters in the alignment
and d is the penalty for each
The values for a,b,c,d appear at the bottom of a Blast report. For BLASTn they are a=1, b=-3, c=5, d=2
By Michael Schroeder, Biotec 103
ExampleQuery: 1 atgctctggccacggcacttgcgga ||||||||||||||| |||| |||Sbjt:107 atgctctggccacggatcttgtgga
tcccagggtgatctgtgcacctgcgata 53 ||||| |||| ||||||||||||||| tccca---tgatatgtgcacctgcgata 156
R = 1 x 46 + -3 x 4 - 5 x 1 - 2 x 3 = 23
So, given the scores: how significant is the alignment?
By Michael Schroeder, Biotec 104
Significance of an alignment
Significance of an alignment needs to be defined with respect to a control population
Pairwise alignment: How can we get control population? Generate sequences randomly? Not a good model of real
sequences Chop up both sequences and randomly reassemble them
Database search: How can we get control population? Control = whole database
Align sequence to control population and see how good result is in comparison
This is captured by Z scores, P-values and E-values
By Michael Schroeder, Biotec 105
Z-score
Z-score normalises the score S: Let m be mean of population and std its standard
deviation, then Z-score = (S – m) / std Z-score of 0 no better than average, hence might
have occurred by chance The higher the Z-score the better
By Michael Schroeder, Biotec 106
P-value
P-value: probability of obtaining a score ≥ S Range: 0 ≤ P ≤ 1 Let m be the number of sequences in the control
population with score ≥ S Let p be the size of the control population Then P-value = m / p Rule of thumb:
P ≤ 10-100 exact match, 10-100 ≤ P ≤ 10-50 nearly identical (SNPs) 10-50 ≤ P ≤ 10-10 homology certain 10-5 ≤ P ≤ 10-1 usually distant relative P > 10-1 probably insignificant
By Michael Schroeder, Biotec 107
E-values
E-value takes also the database into account E-value = expected frequency of a score ≥ S
Range: 0 ≤ E ≤ m, where m is the size of the database Relationship to P: E = m P
E values are calculated from the bit score the length of the query the size of the database
By Michael Schroeder, Biotec 108
BLAST Bit score
The bit score normalizes the raw score S to make score under different settings comparable
The bit score is obtained from the raw score as follows S = ( lambda x R - ln(K) ) / ln(2),
where lambda = 1.37 and K=0.711
Example S = ( 1.37 x 23 - ln(0.711) ) / ln(2) = 46
By Michael Schroeder, Biotec 109
E-value
The E-value is then calculated as follows: E = m x n x 2 -S , where
m is the effective length of the query n is the effective length of the database S is the bit score (effective length takes into account that an alignment
cannot start at the end of a sequence)
Example: m=34 (19 nucleotides fewer than the 53 submitted) n=5,854,611,841 Result: E=0.003
By Michael Schroeder, Biotec 110
Precision and Recall How good are BLAST and FASTA?
True positives, tp = hits which are biologically meaningful False positives, fp = hits which are not biologically meaningful True negatives, tn = non-hits which are not biologically meaningful False negatives, fn = non-hits which are biologically meaningful
Minimise fp and fn Recall: tp/(tp+fn) (meaningful hits / all meaningful) Precision: tp/(tp+fp) (meaningful hits / all hits) But: since no objective data available difficult to judge BLAST
and FASTA’s sensitivity and specificity
By Michael Schroeder, Biotec 111
Multiple Sequence Alignments
By Michael Schroeder, Biotec 112
Multiple Sequence Alignment
Align more than two sequences Choice of sequences
If too closely related then large redundant If very distantly related then difficult to generate good alignment
Additionally use colour for residues with similar properties Yellow Small polar GLy,Ala,Ser,Thr Green Hydrophobic Cys,Val,Ile,Leu,
Pro,Phe,Tyr,Met,Trp Magenta Polar Asn,Gln,His Red Negatively charged Asp,Glu Blue Positively charged Lys, Arg
By Michael Schroeder, Biotec 113
Thioredoxins: WCGPC[K or R] motif
By Michael Schroeder, Biotec 114
Thioredoxins: Gly/Pro = turn
By Michael Schroeder, Biotec 115
Thioredoxins: every second hydrophobic = beta strand
By Michael Schroeder, Biotec 116
Thioredoxins: ca. every 4th hydrophobic = alpha helix
By Michael Schroeder, Biotec 117
By Michael Schroeder, Biotec 118
Profiles, PSI-Blast, HMM
By Michael Schroeder, Biotec 119
Profiles
Derive profile from multiple sequence alignment Useful to
Align distantly related sequences Conserved regions, which may indicate active site Classify subfamilies within homologues
How can profile be used to search Insist on profile (such as WGCPC)? Too strict Use frequence distribution of profile…
By Michael Schroeder, Biotec 120
Consider frequencies
Score for VDFSAS = 13+16+16+7+16+7 ADATAA = 1+16+0+1+16+0 Not good to pick up distant relationships Better: combine with substitution matrix Result: position specific substitution matrix
17124130
1629
351728
1627
1626
132125
YWVTSRQPNMLKIHGFEDCA
By Michael Schroeder, Biotec 121
PSI-Blast
Globin familiy (oxygen transport ) of proteins occurs in many species
Proteins have same function and structure and But there are pairs of members of the family sharing
less than 10% identical residues
A B C
PSI-BLAST idea: score via intermediaries may be better than score from direct comparison
50%
Only 10%
50%
By Michael Schroeder, Biotec 122
PSI-BLAST
PSI-BLAST 1. BLAST 2. Collect top hits 3. Build multiple sequence alignment from significant
local matches 4. Build profile 5. Re-probe database with profile 6. Go back to 2.
By Michael Schroeder, Biotec 123
PSI-BLAST
But beware of PSI-BLAST: False positives propagate and spread through
iterations If protein A consists of domains D and E, and protein B
of domains E and F and protein C of domain F, then PSI-BLAST will relate A and C although they do not share any domain
By Michael Schroeder, Biotec 124
Hidden Markov Model Procedure to generate sequences
State transition systems with three types of states Deletion Insertion Match, which emits residues
Follow probability distribution for successor state Train model on multiple sequence alignment
del del del
start end
match matchmatch
ins insins
By Michael Schroeder, Biotec 125
Summary
Evolutionary model: Indels and substitutions Homologues vs. similarity Dot plots
Easy visual exploration, but not scalable Dynamic programming
Local, global, gaps Substitution matrices (PAM, BLOSUM) BLAST and FASTA Scores and significance
Multiple Sequence Alignments Profiles, PSI-BLAST, HMM
By Michael Schroeder, Biotec 126
Phylogeny
By Michael Schroeder, Biotec 127
Motivation
How did the nucleus get into the eucaryotic cells?
From where are we? Recent Africa vs. Multi-regional Hypothese
In 1999 Encephalitis caused by the West Nile Virus broke out in New York. How did the virus come to New York?
By Michael Schroeder, Biotec 128
How did the nucleus get into the eucaryotic cells?
Simple experiment: Blast classes genes
with related functions in yeast (Eucaryote) against Bacteria
and against Archaea
And count number of significant hits
By Michael Schroeder, Biotec 129
How did the nucleus get into the eucaryotic cells?
Mitochondria und Energy metabolism: Significantly more hits
in bacteria
Cell organisation: Significantly more hits
in Archaea
Fundamental Result without any experiment!
Blue = BacteriaGrey = Archaea
By Michael Schroeder, Biotec 130
Phylogeny
Taxonomists aim to classify and group organisms
E.g. Aristoteles, De Partibus Animalium Ought we, for instance, to begin by discussing
each separate species – man, lion, ox, and the like – taking each kind in hand independently of the rest, or ought we rather to deal first with the attributes which they have in common in virtue of some common element of their nature, and proceed from this as a basis for consideration of them separately other
By Michael Schroeder, Biotec 131
Schools of Taxonomists Goal: create taxonomy
Approach: Phenotype Phylogeny
3 schools: Phenotype only Evolutionary
Taxonomists:Phenotype (+ Phylogeny)
Cladists: Phylogeny (+Phenotype)
By Michael Schroeder, Biotec 132
Practical Application: Westnile virus in NY
Westnile virus mainly in Africa
Transmitted by insects and birds
How did the virus get to NY in 1999
Hundreds of DNA samples taken All 99.8% identical single
entry to NY! Phylogenetic tree allows to
deduce origin
By Michael Schroeder, Biotec 133
Example: Westnil virus in NY How can the trees be
constructed?
By Michael Schroeder, Biotec 134
Distance-based Hierarchical clustering
Character-based Parsimony Maximum likelihood
Three Methods to Generate Phylogenetic Trees
By Michael Schroeder, Biotec 135
Distance-based Approach Single Alignment
Score: 46 matches, 3 mismatches, 1 gap, 3 gap extensions, z.B. Score = 46x1 - 3x1 - 1x2 - 3x1 = 38
Approach: Define distance between two sequences, e.g. percentage of
mismatches in their alignment Construct tree, which groups sequences with minimal
distances iteratively together
atgctctggccacggcacttgcggatcccagggtgatctgtgcacctgcgata||||||||||||||| |||| |||||||| |||| |||||||||||||||atgctctggccacggatcttgtggatccca---tgatatgtgcacctgcgata
By Michael Schroeder, Biotec 136
Hierarchical Clustering
0(4,5)
403
850(1,2)
(4,5)3(1,2)
05
304
5403
89502
9106201
54321
05
304
5403
8950(1,2)
543(1,2)
0(3,(4,5))
50(1,2)
(3,(4,5))(1,2)
1
0
2
3
4
5
1 2 3 4 5
By Michael Schroeder, Biotec 137
Hierarchical Clustering Given a distance matrix D=(dij) with 1≤ i,j ≤ n Result: A binary tree of clusters Init:
ToDo = {} For all i in { 1,…, n } do
Let ti be a tree without children, i.e. a leaf ToDo := ToDo { ti }
Main loop While |ToDo | > 1 do
Find i,j such that dij is minimal Add a new column and row labelled k := (i,j) to D For all indices h of D apart from k,i,j do
dh,k = dk,h := min { dh,i , dh,j } // min = single linkage Let tk be a new tree with children ti and tj
ToDo := ( ToDo { tk } ) - { ti ,tj } Remove columns and rows i,j from D
Complexity: O(n2)
By Michael Schroeder, Biotec 138
Hierarchical Clustering: How to define distance between clusters?
Single linkage: dh,k = dk,h := min { dh,i , dh,j }
Example: Distance (A,B) to C is 1
Complete linkage: dh,k = dk,h := max { dh,i , dh,j }
Example: Distance (A,B) is C is 2
Average linkage: dh,k = dk,h := 0.5 dh,i + 0.5 dh,j
Example: Distance (A,B) to C is 1.5
Are dendrograms always the same independent of the linkage method?
0C
10B
210A
CBA
A B CA B C
By Michael Schroeder, Biotec 139
Parsimony-method Approach: Generate “smallest” tree
containing all the sequences as leaves
Seq 1 2 3 4 5 6 a G G G G G G b G G G A G T c G G A T A G d G A T C A T
3 G->A 4 G->T 5 G->A 2 G->A 3 T->A 4 G->A 4 T->C 6 G->T 6 G->Ta GGGGGG b GGGAGT c GGATAG d GATCAT
By Michael Schroeder, Biotec 140
Parsimony
Generate smallest tree Informative vs. non-informative sites Build pairs with fewest possible substitutions Example:
3 possible trees: ((a,b),(c,d)) or ((a,c),(b,d)) or ((a,d),(b,c))
1,2,3,4 are not informative 5,6 are informative
5: ((a,b),(c,d)) 6: ((a,c),(b,d))
Seq 1 2 3 4 5 6 a G G G G G G b G G G A G T c G G A T A G d G A T C A T
By Michael Schroeder, Biotec 141
Maximum likelihood
Assigns quantitative probabilities to mutation events
Reconstructs ancestors for all nodes in the tree Assigns branch lengths based on probabilities of the
mutational events For each possible tree topology, the assumed
substitution rates are varied to find the parameters that give the highest likelihood of producing the observed data
By Michael Schroeder, Biotec 142
Problems
Character-based methods tend to be better (based on paleontological data)
All make assumptions: No back mutations Same evolutionary rate
By Michael Schroeder, Biotec 143
Assessing Quality: Bootstrapping
Given a tree obtained from one of the methods above Generate Multiple Alignment For a number of interations
Generate new sequences by selecting columns (possibly the same column more than once) form the multiple alignment
Generate tree for the new sequences Compare this new tree with the given tree For each cluster in the given tree, which also approach
in the new tree, the bootstrap value is increased Bootstrap-Value = Percentage of trees containing the
same cluster
By Michael Schroeder, Biotec 144
From where are we?
Recent-Africa Hypothesis Homo Sapiens came 100-200.000 years ago from
Africa Multi-regional Hypothesis
Ancestors of Homo Sapiens left Africa ca. 2.000.000 years ago
Which one’s right?
By Michael Schroeder, Biotec 145
Experiment
Mitochondrial DNA form 53 humans in different regions sequenced
Outgroup = Mitochondrial DNA of chimpanzee
By Michael Schroeder, Biotec 146
A nice phylogeny (Nature 2004)
Nature October 2004 Volume 431 No. 7012
By Michael Schroeder, Biotec 147
Why Mitochondria?
Simple genetic structure No repetitions No Pseudo genes No Introns
No recombination
By Michael Schroeder, Biotec 148
Molecular Clock
Based on genetic and paleontological the most recent common ancestor (mrca) of chimpanzee and homo sapiens dates back 5.000.000 years
Molecular clock: 1.7 x10-8 nucleotide changes per site and year Assumption: equal distribution, no silent mutations Diversity in Afrikca: 3.7 x10-3 nucleotide changes per site and year Diversity outside Africa: 1.7 x10-3 nucleotide changes per site and year Estimated expansion1925 generations ago = ca. 40.000 years Mrca of all humans: 171.500 +/- 50.000 years ago Mrca of African and non-African: 52,000 +/- 27.500 years ago
Experiment supports recent-Africa hypothesis
By Michael Schroeder, Biotec 149
Summary
Schools of taxonomists Assumptions made Methods
Distance-based Character-based