motif characterization: position weight matrix (pwm), perceptron and their applications xuhua xia...
TRANSCRIPT
Motif characterization:Position weight matrix (PWM),
Perceptron and their applications
Xuhua Xia
http://dambe.bio.uottawa.ca
Motif characterization• Input: site-specific sequences
• Approaches:– Consensus sequence (the chance of having NNN…
increases with increasing number of sequences)– Frequency Profile (the problem of mutation bias)– PWM, Sequence logo, Perceptron and Gibbs sampler
(cannot detect column association)– Multiple correspondence analysis
Slide 2
1234567890123A4GALT ATACCATGTCCAAACO2 ACAAAATGGCGCCACR GGAGTATGGTTGAADM2 CCGCCATGGCCCG... ...
Consensus sequence
Slide 3
Sequences flanking the initiation codon of 508 CDSs:
1234567890123A4GALT AUACCAUGUCCAAACO2 ACAAAAUGGCGCCACR GGAGUAUGGUUGAADM2 CCGCCAUGGCCCG.... .....
Site A C G U Cons1 75 173 171 89 N2 105 216 144 43 N3 199 70 212 27 N4 124 236 83 65 N5 60 276 143 29 N6 502 6 0 0 A7 0 0 0 508 U8 0 0 508 0 G9 98 71 274 65 N10 141 227 89 51 N11 49 154 221 84 N12 89 144 196 79 N13 121 151 134 102 NSum 1563 1724 2175 1142 N
Our objective is to find if sites flanking AUG contribute to the start codon recognition. The consensus sequence does not give us the answer
Consensus sequence
Slide 4
Sequences flanking the initiation codon of 508 CDSs:
1234567890123A4GALT AUACCAUGUCCAAACO2 ACAAAAUGGCGCCACR GGAGUAUGGUUGAADM2 CCGCCAUGGCCCG.... .....
Site A C G U A C G U
1 75 173 171 89 0.1476 0.3406 0.3366 0.1752
2 105 216 144 43 0.2067 0.4252 0.2835 0.0846
3 199 70 212 27 0.3917 0.1378 0.4173 0.0531
4 124 236 83 65 0.2441 0.4646 0.1634 0.1280
5 60 276 143 29 0.1181 0.5433 0.2815 0.0571
6 502 6 0 0 0.9882 0.0118 0.0000 0.0000
7 0 0 0 508 0.0000 0.0000 0.0000 1.0000
8 0 0 508 0 0.0000 0.0000 1.0000 0.0000
9 98 71 274 65 0.1929 0.1398 0.5394 0.1280
10 141 227 89 51 0.2776 0.4469 0.1752 0.1004
11 49 154 221 84 0.0965 0.3031 0.4350 0.1654
12 89 144 196 79 0.1752 0.2835 0.3858 0.1555
13 121 151 134 102 0.2382 0.2972 0.2638 0.2008
Sum 1563 1724 2175 1142 0.2367 0.2611 0.3293 0.1729
RCCaugGCGG-3R +4G
Are the red numbers red herrings? The problem of mutation bias.
1
, e.g.,
750.1476
508
ijij
A
fp
N
p
, . .,
15630.2367
508 13
L
ijj
i
A
f
p e gNL
p
What background frequencies to use as control?
Consensus sequenceSequences flanking the initiation codon of 508 CDSs:
1234567890123A4GALT AUACCAUGUCCAAACO2 ACAAAAUGGCGCCACR GGAGUAUGGUUGAADM2 CCGCCAUGGCCCG.... .....
Site A C G U A C G U1 75 173 171 89 0.1476 0.3406 0.3366 0.17522 105 216 144 43 0.2067 0.4252 0.2835 0.08463 199 70 212 27 0.3917 0.1378 0.4173 0.05314 124 236 83 65 0.2441 0.4646 0.1634 0.12805 60 276 143 29 0.1181 0.5433 0.2815 0.05716 502 6 0 0 0.9882 0.0118 0.0000 0.00007 0 0 0 508 0.0000 0.0000 0.0000 1.00008 0 0 508 0 0.0000 0.0000 1.0000 0.00009 98 71 274 65 0.1929 0.1398 0.5394 0.128010 141 227 89 51 0.2776 0.4469 0.1752 0.100411 49 154 221 84 0.0965 0.3031 0.4350 0.165412 89 144 196 79 0.1752 0.2835 0.3858 0.155513 121 151 134 102 0.2382 0.2972 0.2638 0.2008Sum 1563 1724 2175 1142 0.2367 0.2611 0.3293 0.1729
1
, e.g.,
750.1476
508
ijij
A
fp
N
p
, . .,
15630.2367
508 13
L
ijj
i
A
f
p e gNL
p
1 2 3 4 5 6 7 8 9 10 11 12 13
3 4 3 3
( | )
( | )
Yes Yes A C G G T A C C A C G T T
No No A C G G T A C C A C G T T A C G T
L p S p p p p p p p p p p p p p
L p S p p p p p p p p p p p p p p p p p
S = ACGGTACCACGTT
2 1312 2 2 2log log log ...... logYes C TA
No A C T
L p pp
L p p p
Likelihood, odds ratio, log-odds, PWMS
Xuhua Xia Slide 6
Position weight matrix (PWM)• Two major purposes of PWM
– To characterize the sequence pattern (the motif)
– to facilitate the computation of log-odds (or PWM score), e.g., computing the PWMS for ATACCATGTCCAA
, ,2 2
,2 2
,2 2
1 2 2
/log log
/( )
log log
log log
75 0.0001 1563log 13 log 0.67845
1563 0.0001 508 13
i j i jij
i i
i j
i
i j iij
i i
A
p f NPWM
p f NL
fL
f
f fPWM L
f f
PWM
Site A C G U Std
1 -0.6784 0.3844 0.0329 0.0198 0.44532 -0.1939 0.7044 -0.2147 -1.0277 0.70763 0.7275 -0.9188 0.3426 -1.6968 1.12144 0.0457 0.8320 -1.0080 -0.4328 0.77865 -0.9996 1.0578 -0.2247 -1.5941 1.14526 2.0617 -4.4258 -9.5877 -9.5881 5.52887 -9.5879 -9.5878 -9.5877 2.5313 6.05958 -9.5879 -9.5878 1.6025 -9.5881 5.59529 -0.2933 -0.8984 0.7124 -0.4328 0.678210 0.2309 0.7760 -0.9075 -0.7821 0.811211 -1.2910 0.2167 0.4025 -0.0635 0.762612 -0.4320 0.1200 0.2295 -0.1519 0.296113 0.0105 0.1884 -0.3184 0.2163 0.2459
, 21 1
logj
L LYes
S jj j No
PWMS PWM
RCCAUGG
PWMS = -0.6784-1.0277+0.7275 + …+0.0105
Slide 7
PWMS over sites
-15
-10
-5
0
5
10
15
0 10 20 30 40 50
Site
PW
MS
Figure 5-1. Illustration of scanning the 5’-end of the NCF4 gene (30 bases upstream of the initiation codon ATG and 27 bases downstream of ATG. The highest peak, with PWMS = 12.3897, corresponds to the 13-mer with 5 bases flanking the ATG. PWMS computed with = 0.01.
12345678901234567890123456789012345678901234567890123456789012345678901234567890GGACUGGCUGGGCGAGACUCUCCACCUGCUCCCUGGGACCAUCGCCCACCAUGGCUGUGGCCCAGCAGCUGCGGGCCGAG------------- ------------- -------------
PWM: position weight matrix• Also called position-specific scoring matrix (PSSM)• Used in
– Characterizing sequence motifs• Eukaryotic translation initiation consensus• Splicing sites• Branchpoint sites• Shine-Dalgarno sequences
– Database searches• PHI-BLAST• PSI-BLAST• RPS-BLAST)
Slide 8
Slide 9
BLAST ProgramsProgram Database Query Typical Uses
BLASTN/MEGABLAST
Nucleotide Nucleotide MEGABLAST has longer word size than BLASTN
BLASTP Protein Protein Query a protein/peptide against a protein database.
BLASTX Protein Nucleotide Translate a nuc sequence into a “protein” in six frames and search against a protein database
TBLASTN Nucleotide Protein Unannotated nuc sequences (e.g., ESTs) are translated in six frames against which the query protein is searched
TBLASTX Nucleotide Nucleotide 6-frame translation of both query and database
PHI-BLAST Protein Protein Pattern-hit iterated BLAST
PSI-BLAST Protein Protein Position-specific iterated BLAST
RPS-BLAST Protein Protein Reverse PSI-BLAST
Yeast 5’ ss PWM
Slide 10
Table 3: Site-specific frequencies and position weight matrix (PWM) for 275 5′ ss. The consensus sequence (UAAAG ∣GUAUGUU UAAUU) can be obtained from those large site-specific PWM entries, with the most important sites in bold italics . The χ 2 test is performed for each site against the background frequencies (A = 0.3279, C = 0.1915, G = 0.2043, and U = 0.2763). The nucleotide sites are labeled with the five exon nucleotides as −5 to −1 and the 12 intron nucleotides as 1 to 12. The PWM is nearly identical when the introns in 5′ UTR were excluded.
Site A C G U χ 2 p A C G U
−5 94 32 57 92 11.798 0.0081088 0.0641 −0.7117 0.0245 0.2792
−4 119 47 48 61 14.117 0.0027505 0.4032 −0.1599 −0.2225 −0.3115
−3 139 38 43 55 39.672 0.0000001 0.6268 −0.4651 −0.3805 −0.4601
−2 138 40 36 61 38.899 0.0000001 0.6164 −0.3915 −0.6355 −0.3115
−1 91 45 88 51 27.270 0.0000052 0.0174 −0.2223 0.6492 −0.5685
1 0 1 274 0 1060.426 0.0000004 −8.1042 −5.4675 2.2855 −8.1044
2 0 9 0 266 658.096 0.0000003 −8.1042 −2.5200 −8.1048 1.8081
3 268 1 2 4 522.754 0.0000003 1.5723 −5.4675 −4.6732 −4.1523
4 17 29 1 228 428.607 0.0000002 −2.3805 −0.8528 −5.5454 1.5859
5 2 0 272 1 1041.047 0.0000004 −5.2765 −8.1049 2.2750 −5.8967
6 10 8 2 255 583.545 0.0000003 −3.1271 −2.6862 −4.6732 1.7472
7 97 18 39 121 55.570 0.0000001 0.1092 −1.5351 −0.5206 0.6734
8 95 54 35 91 11.363 0.0099180 0.0793 0.0397 −0.6759 0.2635
9 123 45 34 73 22.172 0.0000601 0.4508 −0.2223 −0.7175 −0.0534
10 118 41 38 78 17.334 0.0006034 0.3911 −0.3560 −0.5579 0.0418
11 105 33 43 94 17.367 0.0005940 0.2232 −0.6676 −0.3805 0.3101
12 90 44 42 99 12.109 0.0070180 0.0015 −0.2546 −0.4142 0.3847
Ma and Xia 2011
Yeast 3’ss PWM
Slide 11
Site A C G U 2 p A C G U-12 61 53 34 130 56.0 0 -0.5608 0.0033 -0.7205 0.7648-11 70 47 20 141 83.2 0 -0.3649 -0.1682 -1.4696 0.8813-10 79 42 12 145 99.9 0 -0.1926 -0.3285 -2.1802 0.9214
-9 38 30 23 187 219.4 0 -1.2308 -0.8067 -1.2731 1.2867-8 51 42 27 158 121.6 0 -0.8149 -0.3285 -1.0470 1.0447-7 91 33 28 126 53.8 0 0.0093 -0.6715 -0.9956 0.7200-6 95 42 35 106 22.0 0.0001 0.0707 -0.3285 -0.6794 0.4722-5 93 33 23 129 63.3 0 0.0403 -0.6715 -1.2731 0.7537-4 136 25 38 79 43.3 0 0.5842 -1.0647 -0.5626 0.0517-3 12 121 0 145 272.3 0 -2.8223 1.1862 -6.6480 0.9214-2 277 1 0 0 563.7 0 1.6056 -5.1232 -6.6480 -6.6469-1 0 0 278 0 1082.7 0 -6.6464 -6.6483 2.2900 -6.64691 93 37 73 75 9.7 0.0217 0.0403 -0.5089 0.3691 -0.02262 72 64 54 88 8.0 0.0466 -0.3248 0.2729 -0.0619 0.20593 90 54 48 86 2.5 0.4771 -0.0065 0.0300 -0.2299 0.17304 83 43 54 98 8.7 0.0337 -0.1221 -0.2950 -0.0619 0.35995 90 65 37 86 10.6 0.0140 -0.0065 0.2951 -0.6005 0.1730
Table 4. Site-specific frequencies and position weight matrix (PWM) for 278 3’ ss. The consensus sequence (UUUUUUUUAYAG|GCUUC) can be obtained from those large site-specific PWM entries, with the most important sites in bold italics. The 2 test is performed for each site against the expected background frequencies. The sites are labeled with first exon site as 1.
Ma and Xia 2011
PWMS as a proxy of splicing strength
Slide 13
5' ss 3' ss
NRG RG NRG RG
PWMS Mean 8.8138 11.1978 5.3129 7.1762
PWMS Var. 31.5069 4.8646 13.3017 8.2077
N 44 202 49 229
t -4.6346 -3.9257
p 0.0000 0.0001
Table 6. Position weight matrix scores (PWMS, as a proxy for splicing strength) is significantly smaller for splice sites from intron-containing genes (ICGs) whose transcripts failed to recruit U1 snRNPs (NRG for non-recruiting group) than for those from ICGs whose transcripts binds well to U1 snRNPs (RG for recruiting group). The pattern is consistent for both 5' ss and 3' ss, based on two-sample t-tests assuming equal variances. Mann-Whitney tests yield the same conclusion.
0.9
0.92
0.94
0.96
0.98
1
1.02
1.04
1.06
1.08
1.1
1.12
0 5 10 15 20 25 30 35
Gene expression
Sp
licin
g e
ffic
ien
cyHighly expressed genes should have high splicing efficiency.
Lowly
expr
esse
d gen
es co
uld ha
ve th
eir
splic
ing si
tes dr
ifting
to lo
w effic
iency Predictions:
(1) Highly transcribed genes should, on average, have introns with greater splicing efficiency(2) Lowly transcribed genes should have greater variance in splicing efficiency than highly transcribed genes.
PWMS and Splicing Mechanisms• Expected PWMS is 0 when there is no site-specific
difference in nucleotide frequency distribution• What does a strongly negative PWMS mean?• 5’ ss:
– HAC1: -8.8291– HFM1: -7.3825– HOP2: -7.8898
• 3’ ss: – HAC1: -4.4039– REC102: -3.4464
Slide 16
Slide 17
Perceptron• The perceptron is one of the simplest artificial neural
networks invented in 1957 at the Cornell Aeronautical Laboratory by Frank Rosenblatt (Rosenblatt, 1958).
• Perceptron has been used in bioinformatics research since 1980s:– The identification of translational initiation sites in E. coli
(Stormo et al., 1982a).– Characterizing the ATP/GTP-binding motif (Hirst and
Sternberg, 1991).– More recent publications use multi-layer perceptrons
which is more complicated than what we cover here.
Slide 18
What perceptron does• Positive sequencesPOS1 ACGTPOS2 GCGC
• Negative sequencesNEG1 AGCTNEG2 GGCC
• Objective: Find a scoring matrix that can distinguish between the two groups (positive and negative) of sequences
Slide 19
Definitions
Table 5-3. The weighting matrix (W) for the fictitious example with two sequences of length 4 in each group, initialized with values of 1. The first row designates sites 1-4.
Base 1 2 3 4
A 1 1 1 1
C 1 1 1 1
G 1 1 1 1
T 1 1 1 1
POS1 ACGTPOS2 GCGC
NEG1 AGCTNEG2 GGCC
,1
j
L
S jj
PS W
1 ,1 ,2 ,3 ,4 4POS A C G TPS W W W W
, ,
, ,
,
1, if S is from POS group and PS < 0
1, if S is from NEG group and PS 0
No change in otherwise.
One may initialize all values to 0 instead of 1.
j j
j j
j
S j S j
S j S j
S j
W W
W W
W
For amino acid sequences, the matrix would be 20 by 4.
Slide 20
Iterations and convergence
First round of the training process in the perceptron algorithm. Updated values are highlighted in bold. NEG1: AGCT, PS = 4, update Base 1 2 3 4 A 0 1 1 1 (a) C 1 1 0 1 G 1 0 1 1 T 1 1 1 0 NEG2: GGCC, PS = 2, update A 0 1 1 1 C 1 1 -1 0 (b) G 0 -1 1 1 T 1 1 1 0 POS1: ACGT, PS = 2, no update A 0 1 1 1 C 1 1 -1 0 (c) G 0 -1 1 1 T 1 1 1 0 POS2: GCGC, PS = 2, no update A 0 1 1 1 C 1 1 -1 0 (d) G 0 -1 1 1 T 1 1 1 0
Slide 21
Post-processing
Base 1 2 3 4
A 0 1 1 1
C 1 1 -1 0
G 0 -1 1 1
T 1 1 1 0
Base 1 2 3 4
A 0 0 0 0
C 0 1 -1 0
G 0 -1 1 0
T 0 0 0 0
POS1 ACGTPOS2 GCGC
NEG1 AGCTNEG2 GGCC
What is the scorefor:
TAAA?
A WSi,j = 0 means either there is no data on that cell or the cell has no discriminant power
Doublet perceptron
Slide 22
1234567890P1 ACGUAUACGUP2 ACGUCUACGUP3 ACGUGUACGUP4 ACGUUAACGUP5 ACGUUCACGUP6 ACGUUGACGU
N1 ACGUAAACGUN1 ACGUACACGUN1 ACGUAGACGUN1 ACGUCAACGUN1 ACGUCCACGUN1 ACGUCGACGUN1 ACGUGAACGUN1 ACGUGCACGUN1 ACGUGGACGUN1 ACGUUUACGU
1 2 3 4 5 6 7 8 9AC CG GU UA AU UA AC CG GUAC CG GU UC CU UA AC CG GUAC CG GU UG GU UA AC CG GUAC CG GU UU UA AA AC CG GUAC CG GU UU UC CA AC CG GUAC CG GU UU UG GA AC CG GU
AC CG GU UA AA AA AC CG GUAC CG GU UA AC CA AC CG GUAC CG GU UA AG GA AC CG GUAC CG GU UC CA AA AC CG GUAC CG GU UC CC CA AC CG GUAC CG GU UC CG GA AC CG GUAC CG GU UG GA AA AC CG GUAC CG GU UG GC CA AC CG GUAC CG GU UG GG GA AC CG GUAC CG GU UU UU UA AC CG GU
Doublet Perceptron
Slide 23
Doublet1 2 3 4 5 6 7 8 9
AA 0 0 0 0 -6 -4.3 0 0 0AC 0 0 0 0 -4 0 0 0 0AG 0 0 0 0 -2 0 0 0 0AU 0 0 0 0 8.33 0 0 0 0CA 0 0 0 0 -4 -1 0 0 0CC 0 0 0 0 -1 0 0 0 0CG 0 0 0 0 -1 0 0 0 0CU 0 0 0 0 5 0 0 0 0GA 0 0 0 0 -1 -0.7 0 0 0GC 0 0 0 0 -1 0 0 0 0GG 0 0 0 0 -1 0 0 0 0GU 0 0 0 0 3.33 0 0 0 0UA 0 0 0 -3.7 6.67 5.67 0 0 0UC 0 0 0 -1 5 0 0 0 0UG 0 0 0 0.33 3.33 0 0 0 0UU 0 0 0 4 -11 0 0 0 0
Large amount of data are needed to avoid the problem of overfitting
Gene/Motif Prediction• Objective: given molecular sequence, find its biological
function (preferably in terms of gene ontology). – Cellular localization– Biological processes the gene (its product) participates in– The biological reaction
• Related terms:– Motif: e.g., RccAUGG– Fingerprint: a set of aligned sequences from which a position weight
matrix or the like can be constructed to predict the motif effectively• Gene/Motif prediction methods
– Position weight matrix– Perceptrons– Supervised learning– Hidden Markov Models (HMMs)– Neural networks (e.g., self-organizing map or SOM)