ocena przydatności modeli markowa do różnych zastosowań w bioinformatyce
DESCRIPTION
Ocena przydatności modeli Markowa do różnych zastosowań w bioinformatyce. Jacek Leluk Interdyscyplinarne Centrum Modelowania Matematycznego i Komputerowego Uniwersytet Warszawski. Jacek Leluk Interdyscyplinarne Centrum Modelowania Matematycznego i Komputerowego, Uniwersytet Warszawski. - PowerPoint PPT PresentationTRANSCRIPT
Jacek Leluk, Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University
Ocena przydatności modeli Markowa do różnych zastosowań
w bioinformatyce
Jacek Leluk
Interdyscyplinarne Centrum Modelowania Matematycznego i Komputerowego, Uniwersytet Warszawski
Jacek LelukInterdyscyplinarne Centrum Modelowania
Matematycznego i KomputerowegoUniwersytet Warszawski
Jacek Leluk, Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University
Modele Markowa w identyfikacji i lokalizacji
sekwencji kodujących w genomie
Jacek Leluk
Interdyscyplinarne Centrum Modelowania Matematycznego i Komputerowego, Uniwersytet Warszawski
Jacek Leluk, Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University
Indeks okresowej asymetrii
Asymetria pozycji
Używanie kodonu
Modele Markowa
Prototyp kodonu
Metody oparte na wzorcowym DNA kodującym
Metody niezależne od wzorcowego DNA kodującego
Identyfikacja regionów kodujących w genomie
występowanie oligonukleotydów
tendencje w obsadzeniu
pozycji kodonu
zależności w obsadzeniu sąsiadujących
pozycji
tendencje w obsadzeniu
pozycji kodonu
okresową korelację między
pozycjami nukleotydów
Średnia informacjawzględna
Widma Fouriera
Używanie amino-kwasu
Preferencje kodonów
Używanie heksamerów
wykorzystujące: wykorzystujące:
Jacek Leluk, Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University
Modele Markowa(Markov Models)
Metody wymagające wzorcowego DNA kodującegoTendencje w obsadzeniu kolejnych sąsiadujących pozycji
W modelach Markowa prawdopodobieństwo wystąpienia danego nukleotydu w określonej pozycji kodonu zależy od rodzaju
nukletydu(-ów) bezpośrednio poprzedzającego (-ych) w sekwencji.
Najprostszym przykładem jest model Markowa I rzędu. Model Markowa I rzędu oparty jest na prawdopodobieństwie napotkania każdego z 4 nukletydów w każdej z trzech pozycji kodonu, uwzględniającym zależność od rodzaju nukleotydu, który tę pozycje poprzedza. W metodzie tej wykorzystuje się trzy 4x4 macierze tranzycji (F1, F2 i F3), z których każda odnosi się do każdej z trzech pozycji kodonu.
Stosowane są modele Markowa rzędu od 1 do 5.
Jacek Leluk, Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University
Genetic conditioning of the amino acid replacement probabilities and spectrum in
molecular evolution
Jacek Leluk, Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University
Do the amino acids possess their pedigree ?
or...
Do they contain the information about their history (genealogy)?
Can the amino acid mutational replacements described as Markovian processes ?
or...
Jacek Leluk, Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University
The Markov model assumes that the substitution probability of amino acid AA1 by AA2 is the same, regardless of what the initial
residue AA1 was transformed from (AAx, AAy)
The currently used statistical algorithms are based on Markovian model of the amino acid replacement (they directly use stochastic
matrices of replacement frequency indices)
AA1 AA2AAx
Pa
AA1 AA2AAy
Pb
Pa = Pb
Jacek Leluk, Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University
C 12 S 0 2 T -2 1 3 P -3 1 0 6 A -2 1 1 1 2 G -3 1 0 -1 1 5 N -4 1 0 -1 0 0 2 D -5 0 0 -1 0 1 2 4 E -5 0 0 -1 0 0 1 3 4 Q -5 -1 -1 0 0 -1 1 2 2 4 H -3 -1 -1 0 -1 -2 2 1 1 3 6 R -4 0 -1 0 -2 -3 0 -1 -1 1 2 6 K -5 0 0 -1 -1 -2 1 0 0 1 0 3 5 M -5 -2 -1 -2 -1 -3 -2 -3 -2 -1 -2 0 0 6 I -2 -1 0 -2 -1 -3 -2 -2 -2 -2 -2 -2 -2 2 5 L -6 -3 -2 -3 -2 -4 -3 -4 -3 -2 -2 -3 -3 4 2 6 V -2 -1 0 -1 0 -1 -2 -2 -2 -2 -2 -2 -2 2 4 2 4 F -4 -3 -3 -5 -5 -5 -4 -6 -5 -5 -2 -4 -5 0 1 2 -1 9 Y 0 -3 -3 -5 -3 -5 -2 -4 -4 -4 0 -4 -4 -2 -1 -1 -2 7 10 W -8 -2 -5 -6 -6 -7 -4 -7 -7 -5 -3 2 -3 -4 -5 -2 -6 0 0 17 C S T P A G N D E Q H R K M I L V F Y W
PAM250 matrix of amino acid replacements
Why tryptophane is here the most conservative residue?
Jacek Leluk, Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University
A 4 R -1 5 N -2 0 6 D -2 -2 1 6 C 0 -3 -3 -3 9 Q -1 1 0 0 -3 5 E -1 0 0 2 -4 2 5 G 0 -2 0 -1 -3 -2 -2 6 H -2 0 1 -1 -3 0 0 -2 8 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4
A R N D C Q E G H I L K M F P S T W Y V
BLOSUM62 matrix of amino acid replacements
Jacek Leluk, Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University
Arg Lys
PAM250 3
BLOSUM62 2
BLOSUM35 2
BLOSUM45 3
BLOSUM100 3
Replacemant Arg Lys according to the statistical interpretation using stochastical matrix indices
Jacek Leluk, Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University
Q
Q
H
H
–
–
Y
Y
E
E
D
D
K
K
N
N
R
R
R
R
–
W
C
C
G
G
G
G
R
R
S
S
P
P
P
P
S
S
S
S
A
A
A
A
T
T
T
T
L
L
L
L
L
L
F
F
V
V
V
V
I
M
I
I
AGCU 1
3 2
Diagram of genetic relationships between amino acids
Jacek Leluk, Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University
AGCU 1
3 2
Q
Q
H
H
–
–
Y
Y
E
E
D
D
K
K
N
N
R
R
R
R
–
W
C
C
G
G
G
G
R
R
S
S
P
P
P
P
S
S
S
S
A
A
A
A
T
T
T
T
L
L
L
L
L
L
F
F
V
V
V
V
I
M
I
I
Diagram of amino acid genetic relationships CAA UAA GAA AAA
CAG UAG GAG AAG
CAC UAC GAC AAC
CAU UAU GAU AAU
CGA UGA GGA AGA
CGG UGG GGG AGG
CGC UGC GGC AGC
CGU UGU GGU AGU
CCA UCA GCA ACA
CCG UCG GCG ACG
CCC UCC GCC ACC
CCU UCU GCU ACU
CUA UUA GUA AUA
CUG UUG GUG AUG
CUC UUC GUC AUC
CUU UUU GUU AUU
Diagram of codon genetic relationships
Jacek Leluk, Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University
MetAUG
MetAUG
ArgAGG
ArgAGG
LysAAG
LysAAG
ProCCC
ProCCC
AsnAAC
AsnAAC
ArgAGG
ArgAGG
GlnCAG
GlnCAG
HisCAC
HisCAC
SerAGC
SerAGC
ArgCGG
ArgCGG
ArgCGC
ArgCGC
LysAAG
LysAAG
?
Arginine-to-lysine mutational conversion pathways for arginines of different origin
Jacek Leluk, Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University
Thr
Ser
SerUCG
SerAGU
Ile Asn
Arg Cys
Gly
TrpUGG
AlaThr Pro
TrpSer Leu
(UAG)
AsnAAU
Possible single-point-mutational processing of serine with respect to its origin
Jacek Leluk, Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University
Amino acid mutational substitution based on the single transition/transversion is NOT the Markovian
process
Theoretical proof The conversion pathway of arginine into lysine, glutamine
and serine for arginine resulting from the processing of the codons encoding different amino acids
Possible codons for arginine: AGA AGG CGA CGG CGC CGT
Jacek Leluk, Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University
Conversion of arginine into lysine
GlnCAR
ArgAGR
ArgAGR
SerAGY
ArgCGR
HisCAY
LysAARCGY
Arg
LeuCTR
LysAARCGR
Arg
MetATG
LysAAGAGG
Arg
Jacek Leluk, Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University
ArgAGR
ArgCGY
MetATG
SerAGY
ArgAGG
LeuCTR
SerAGY
ArgCGR
HisCAY
SerAGY
ArgCGY
Conversion of arginine into serine
Jacek Leluk, Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University
LysAAG
ArgCGG
HisCAY
ArgCGR
MetATG
GlnCAG
ArgAGG
LeuCTR
GlnCAR
ArgCGR
HisCAY
GlnCAR
ArgCGY
Conversion of arginine into glutamine
Jacek Leluk, Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University
Probability of the replacement of one amino acid into another depends significantly on what amino acids occupied that
position in the past
There is a high risk, that commonly used algorithms applying the stochastic data matrices (MDM, PAM, BLOSUM) lead to
the wrong interpretation of mutational processes occurring in proteins
then...
Jacek Leluk, Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University
H
H
–
–
Y
Y
E
E
D
D
K
K
N
N
R
R
–
W
C
C
G
G
G
G
R
S
S
P
P
P
P
S
S
S
S
A
A
A
A
T
T
T
T
L
L
L
L
L
L
F
F
V
V
V
V
I
I
I
AGCU 1
3 2
Genetic relationhips between Arg and Met/Gln
M
R
R
Q
R
Q
Jacek Leluk, Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University
Inhibitory z roślin dyniow atych Inhibitory typu B ow m ana-B irk D om eny ow om ukoidu (typ K azala )
1. R VM IG * 2 . R VM IG S * 3 . C 4. P 5. R KL 6. I 7 . [LW ][Y ] 8 . M N K 9. R EKQ P # 10. C 11. KSQ T 12. [KSR H Q TY][V ] # 13. D N 14. R SD A 15. D 16. C 17. LFM P 18. ALTPG R 19. D EG Q K 20. C 21. V ITKR 22. C 23. LKG Q M V 24. PKEQ R SA # 25. N H EQ D S – 26. [I][D ]– 27. G E 28. YFIH 29. C 30. G
47. C 48. C 49. D R BSN 50. Q H ELZR SIFTK # 51. C 52. ASTKEM ILR D VPF * 53. C 54. T 55. [KR ][A ] 56. S 57. N M IEKR D Q *# 58. P 59. P 60. Q KZETI 61. C 62. [R H Q S][V ] # 63. C 64. STN VAEH R 65. D ZBN 66. M ILVTR * 67. R 68. L 69. N D E 70. SKTR 71. C 72. H 73. S 74. A 75. C 76. KSD EN 77. SLG R TFH 78. C
79. IAVLM 80. C 81. ATN R 82. LYFR K 83. S 84. Y IEFM Q D N 85. P 86. AG P 87. Q KZM 88. C 89. FVR IH SQ # 90. C 91. VTBG LAYF 92. D B 93. [IM TV][Q ] 94. TN BKAH D 95. D BN KT 96. FSY 97. C 98. [YH ][T ] 99. EAKPD 100. PSAK 101. C
1. V ILE 2. N D H 3. C 4. [STR ][D ] 5 . LPKQ E 6. YF 7. ALPKQ 8. SQ TK 9. G TR S– 10. IVKN T 11. G VSTL 12. KR TQ – # 13. D G N – 14. G – 15. TN R KE – 16. STLAQ P 17. W M LIV– 18. VT I 19. [A ][R ]– 20. C 21. PT 22. [R M ][F ] * 23. [N I][E ] 24. [L ][Y ] 25. KSLQ D V 26. [P ][E ] 27. [V ][H ] 28. C 29. G A 30. TS 31. D N 32. G S 33. SFV
34. T 35. Y 36. SD A 37. N S 38. [ED ][R ] 39. C 40. G STF 41. ILF 42. C 43. [L ][A ][N ] 44. [YH ][A ] 45. N Y 46. R AILV 47. EQ 48. H Q LS 49. G H R N 50. ATR 51. [N H ST ][E ] 52. V IL 53. ESKAG N 54. [K ][L ] 55. ELKSR V 56. [YH S][K ] 57. [D N ][M ] 58. G A 59. EKR A 60. C 61. R KE 62. PLQ E 63. KER D 64. [ISV ][H ] 65. [VG ][PT ] 66. [M EK][PS ]
Arg-Met and Arg-Gln substitutions. „Two kinds” of arginine
Jacek Leluk, Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University
C 12 S 0 2 T -2 1 3 P -3 1 0 6 A -2 1 1 1 2 G -3 1 0 -1 1 5 N -4 1 0 -1 0 0 2 D -5 0 0 -1 0 1 2 4 E -5 0 0 -1 0 0 1 3 4 Q -5 -1 -1 0 0 -1 1 2 2 4 H -3 -1 -1 0 -1 -2 2 1 1 3 6 R -4 0 -1 0 -2 -3 0 -1 - 1 1 2 6 K -5 0 0 -1 -1 -2 1 0 0 1 0 3 5 M -5 -2 -1 -2 -1 -3 -2 -3 -2 -1 -2 0 0 6 I -2 -1 0 -2 -1 -3 -2 -2 -2 -2 -2 -2 -2 2 5 L -6 -3 -2 -3 -2 -4 -3 -4 -3 -2 -2 -3 -3 4 2 6 V -2 -1 0 -1 0 -1 -2 -2 -2 -2 -2 -2 -2 2 4 2 4 F -4 -3 -3 -5 -5 -5 -4 -6 -5 -5 -2 -4 -5 0 1 2 -1 9 Y 0 -3 -3 -5 -3 -5 -2 -4 -4 -4 0 -4 -4 -2 -1 -1 -2 7 10 W -8 -2 -5 -6 -6 -7 -4 -7 -7 -5 -3 2 -3 -4 -5 -2 -6 0 0 17 C S T P A G N D E Q H R K M I L V F Y W
PAM250 matrix of amino acid replacements
Jacek Leluk, Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University
PAM250 and BLOSUM62 scores for the replacements:
Arg-Lys Lys-Gln Lys-Glu Arg-Gln and Arg-Glu
Replacement PAM250 BLOSUM62
Arg/Lys 3 2
Lys/Gln 1 1
Arg/Gln 1 1
Lys/Glu 0 1
Arg/Glu -1 0
Jacek Leluk, Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University
H
H
–
–
Y
Y
D
D
N
N
R
R
–
W
C
C
G
G
G
G
S
S
P
P
P
P
S
S
S
S
A
A
A
A
T
T
T
T
L
L
L
L
L
L
F
F
V
V
V
V
I
M
I
I
AGCU 1
3 2
Genetic relationships among Arg, Lys, Glu and Gln
R
R R
R
K
K E Q
E Q
Jacek Leluk, Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University
Inhibitory z roślin dyniowatych Inhibitory typu Bowmana-Birk Domeny owomukoidu (typ Kazala)
1. RVMIG 2. RVMIGS 3. C 4. P 5. RKL 6. I 7. [LW][Y] 8. MNK 9. REKQP 10. C 11. KSQT 12. [KSRHQTY][V] 13. DN 14. RSDA 15. D 16. C 17. LFMP 18. ALTPGR 19. DEGQK 20. C 21. VITKR 22. C 23. LKGQMV 24. PKEQRSA 25. NHEQDS– 26. [I][D]– 27. GE 28. YFIH 29. C 30. G
47. C 48. C 49. DRBSN 50. QHELZRSIFTK 51. C 52. ASTKEMILRDVPF 53. C 54. T 55. [KR][A] 56. S 57. NMIEKRDQ 58. P 59. P 60. QKZETI 61. C 62. [RHQS][V] 63. C 64. STNVAEHR ! 65. DZBN 66. MILVTR 67. R 68. L 69. NDE 70. SKTR 71. C 72. H 73. S 74. A 75. C 76. KSDEN 77. SLGRTFH 78. C
79. IAVLM 80. C 81. ATNR 82. LYFRK 83. S 84. YIEFMQDN 85. P 86. AGP 87. QKZM 88. C 89. FVRIHSQ 90. C 91. VTBGLAYF 92. DB 93. [IMTV][Q] 94. TNBKAHD 95. DBNKT 96. FSY 97. C 98. [YH][T] 99. EAKPD 100. PSAK 101. C
1. VILE 2. NDH 3. C 4. [STR][D] 5. LPKQE 6. YF 7. ALPKQ 8. SQTK 9. GTRS– 10. IVKNT 11. GVSTL 12. KRTQ– 13. DGN– 14. G– 15. TNRKE– 16. STLAQP 17. WMLIV– 18. VTI 19. [A][R]– 20. C 21. PT 22. [RM][F] 23. [NI][E] 24. [L][Y] 25. KSLQDV 26. [P][E] 27. [V][H] 28. C 29. GA 30. TS 31. DN 32. GS 33. SFV
34. T 35. Y 36. SDA 37. NS 38. [ED][R] ! 39. C 40. GSTF 41. ILF 42. C 43. [L][A][N] 44. [YH][A] 45. NY 46. RAILV 47. EQ 48. HQLS 49. GHRN 50. ATR 51. [NHST][E] 52. VIL 53. ESKAGN 54. [K][L] 55. ELKSRV 56. [YHS][K] 57. [DN][M] 58. GA 59. EKRA 60. C 61. RKE 62. PLQE 63. KERD 64. [ISV][H] 65. [VG][PT] 66. [MEK][PS]
Arg-Glu and Lys-Glu substitutions (Arg/Lys/Gln/Glu replacements)
Jacek Leluk, Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University
Multiple alignment of seven chicken
ovoinhibitor domains obtained with
Markovian and non-Markovian methods
Jacek Leluk, Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University
What part of the codon contains the information about the previous amino acid that occurred at certain position of the
protein sequence?
At most 2/3 of the entire codon.
AlaGCG
ValGUG
Jacek Leluk, Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University
How long is the information about codons of preceeding amino acids stored?
Theoreticaly the longest period is infinite
The shortest storage period is 3 transitions/transversions
AlaGCG
ValGUG
MetAUG
IleAUA
SerUCC
SerUCU
ThrACU
SerAGU
LysAAA
AsnAAC
AspGAC
HisCAC
GlnCAG
GluGAG
AspGAU
HisCAU
AsnAAU
LysAAG
GlnCAG
HisCAC
TyrUAU . . .
Jacek Leluk, Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University
The analysis of genetic semihomology excludes applicability of Markov model for the studies on protein
variability at the amino acid level.
The amino acid codons do contain the information about the „ancestral” amino acids, whose codons were the
starting point to the codon of current residue.
It refers mainly to the positions undergoing single-point mutations as the most basic mechanism of evolutionary
variability.
CONCLUSIONS
Jacek Leluk, Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University
Thank you for your attention !Thank you for your
attention!