motif characterization: position weight matrix (pwm), perceptron and their applications xuhua xia...

24
Motif characterization: Position weight matrix (PWM), Perceptron and their applications Xuhua Xia [email protected] http://dambe.bio.uottawa.ca

Upload: cuthbert-norris

Post on 16-Dec-2015

230 views

Category:

Documents


3 download

TRANSCRIPT

Motif characterization:Position weight matrix (PWM),

Perceptron and their applications

Xuhua Xia

[email protected]

http://dambe.bio.uottawa.ca

Motif characterization• Input: site-specific sequences

• Approaches:– Consensus sequence (the chance of having NNN…

increases with increasing number of sequences)– Frequency Profile (the problem of mutation bias)– PWM, Sequence logo, Perceptron and Gibbs sampler

(cannot detect column association)– Multiple correspondence analysis

Slide 2

1234567890123A4GALT ATACCATGTCCAAACO2 ACAAAATGGCGCCACR GGAGTATGGTTGAADM2 CCGCCATGGCCCG... ...

Consensus sequence

Slide 3

Sequences flanking the initiation codon of 508 CDSs:

1234567890123A4GALT AUACCAUGUCCAAACO2 ACAAAAUGGCGCCACR GGAGUAUGGUUGAADM2 CCGCCAUGGCCCG.... .....

Site A C G U Cons1 75 173 171 89 N2 105 216 144 43 N3 199 70 212 27 N4 124 236 83 65 N5 60 276 143 29 N6 502 6 0 0 A7 0 0 0 508 U8 0 0 508 0 G9 98 71 274 65 N10 141 227 89 51 N11 49 154 221 84 N12 89 144 196 79 N13 121 151 134 102 NSum 1563 1724 2175 1142 N

Our objective is to find if sites flanking AUG contribute to the start codon recognition. The consensus sequence does not give us the answer

Consensus sequence

Slide 4

Sequences flanking the initiation codon of 508 CDSs:

1234567890123A4GALT AUACCAUGUCCAAACO2 ACAAAAUGGCGCCACR GGAGUAUGGUUGAADM2 CCGCCAUGGCCCG.... .....

Site A C G U A C G U

1 75 173 171 89 0.1476 0.3406 0.3366 0.1752

2 105 216 144 43 0.2067 0.4252 0.2835 0.0846

3 199 70 212 27 0.3917 0.1378 0.4173 0.0531

4 124 236 83 65 0.2441 0.4646 0.1634 0.1280

5 60 276 143 29 0.1181 0.5433 0.2815 0.0571

6 502 6 0 0 0.9882 0.0118 0.0000 0.0000

7 0 0 0 508 0.0000 0.0000 0.0000 1.0000

8 0 0 508 0 0.0000 0.0000 1.0000 0.0000

9 98 71 274 65 0.1929 0.1398 0.5394 0.1280

10 141 227 89 51 0.2776 0.4469 0.1752 0.1004

11 49 154 221 84 0.0965 0.3031 0.4350 0.1654

12 89 144 196 79 0.1752 0.2835 0.3858 0.1555

13 121 151 134 102 0.2382 0.2972 0.2638 0.2008

Sum 1563 1724 2175 1142 0.2367 0.2611 0.3293 0.1729

RCCaugGCGG-3R +4G

Are the red numbers red herrings? The problem of mutation bias.

1

, e.g.,

750.1476

508

ijij

A

fp

N

p

, . .,

15630.2367

508 13

L

ijj

i

A

f

p e gNL

p

What background frequencies to use as control?

Consensus sequenceSequences flanking the initiation codon of 508 CDSs:

1234567890123A4GALT AUACCAUGUCCAAACO2 ACAAAAUGGCGCCACR GGAGUAUGGUUGAADM2 CCGCCAUGGCCCG.... .....

Site A C G U A C G U1 75 173 171 89 0.1476 0.3406 0.3366 0.17522 105 216 144 43 0.2067 0.4252 0.2835 0.08463 199 70 212 27 0.3917 0.1378 0.4173 0.05314 124 236 83 65 0.2441 0.4646 0.1634 0.12805 60 276 143 29 0.1181 0.5433 0.2815 0.05716 502 6 0 0 0.9882 0.0118 0.0000 0.00007 0 0 0 508 0.0000 0.0000 0.0000 1.00008 0 0 508 0 0.0000 0.0000 1.0000 0.00009 98 71 274 65 0.1929 0.1398 0.5394 0.128010 141 227 89 51 0.2776 0.4469 0.1752 0.100411 49 154 221 84 0.0965 0.3031 0.4350 0.165412 89 144 196 79 0.1752 0.2835 0.3858 0.155513 121 151 134 102 0.2382 0.2972 0.2638 0.2008Sum 1563 1724 2175 1142 0.2367 0.2611 0.3293 0.1729

1

, e.g.,

750.1476

508

ijij

A

fp

N

p

, . .,

15630.2367

508 13

L

ijj

i

A

f

p e gNL

p

1 2 3 4 5 6 7 8 9 10 11 12 13

3 4 3 3

( | )

( | )

Yes Yes A C G G T A C C A C G T T

No No A C G G T A C C A C G T T A C G T

L p S p p p p p p p p p p p p p

L p S p p p p p p p p p p p p p p p p p

S = ACGGTACCACGTT

2 1312 2 2 2log log log ...... logYes C TA

No A C T

L p pp

L p p p

Likelihood, odds ratio, log-odds, PWMS

Xuhua Xia Slide 6

Position weight matrix (PWM)• Two major purposes of PWM

– To characterize the sequence pattern (the motif)

– to facilitate the computation of log-odds (or PWM score), e.g., computing the PWMS for ATACCATGTCCAA

, ,2 2

,2 2

,2 2

1 2 2

/log log

/( )

log log

log log

75 0.0001 1563log 13 log 0.67845

1563 0.0001 508 13

i j i jij

i i

i j

i

i j iij

i i

A

p f NPWM

p f NL

fL

f

f fPWM L

f f

PWM

Site A C G U Std

1 -0.6784 0.3844 0.0329 0.0198 0.44532 -0.1939 0.7044 -0.2147 -1.0277 0.70763 0.7275 -0.9188 0.3426 -1.6968 1.12144 0.0457 0.8320 -1.0080 -0.4328 0.77865 -0.9996 1.0578 -0.2247 -1.5941 1.14526 2.0617 -4.4258 -9.5877 -9.5881 5.52887 -9.5879 -9.5878 -9.5877 2.5313 6.05958 -9.5879 -9.5878 1.6025 -9.5881 5.59529 -0.2933 -0.8984 0.7124 -0.4328 0.678210 0.2309 0.7760 -0.9075 -0.7821 0.811211 -1.2910 0.2167 0.4025 -0.0635 0.762612 -0.4320 0.1200 0.2295 -0.1519 0.296113 0.0105 0.1884 -0.3184 0.2163 0.2459

, 21 1

logj

L LYes

S jj j No

PWMS PWM

RCCAUGG

PWMS = -0.6784-1.0277+0.7275 + …+0.0105

Slide 7

PWMS over sites

-15

-10

-5

0

5

10

15

0 10 20 30 40 50

Site

PW

MS

Figure 5-1. Illustration of scanning the 5’-end of the NCF4 gene (30 bases upstream of the initiation codon ATG and 27 bases downstream of ATG. The highest peak, with PWMS = 12.3897, corresponds to the 13-mer with 5 bases flanking the ATG. PWMS computed with = 0.01.

12345678901234567890123456789012345678901234567890123456789012345678901234567890GGACUGGCUGGGCGAGACUCUCCACCUGCUCCCUGGGACCAUCGCCCACCAUGGCUGUGGCCCAGCAGCUGCGGGCCGAG------------- ------------- -------------

PWM: position weight matrix• Also called position-specific scoring matrix (PSSM)• Used in

– Characterizing sequence motifs• Eukaryotic translation initiation consensus• Splicing sites• Branchpoint sites• Shine-Dalgarno sequences

– Database searches• PHI-BLAST• PSI-BLAST• RPS-BLAST)

Slide 8

Slide 9

BLAST ProgramsProgram Database Query Typical Uses

BLASTN/MEGABLAST

Nucleotide Nucleotide MEGABLAST has longer word size than BLASTN

BLASTP Protein Protein Query a protein/peptide against a protein database.

BLASTX Protein Nucleotide Translate a nuc sequence into a “protein” in six frames and search against a protein database

TBLASTN Nucleotide Protein Unannotated nuc sequences (e.g., ESTs) are translated in six frames against which the query protein is searched

TBLASTX Nucleotide Nucleotide 6-frame translation of both query and database

PHI-BLAST Protein Protein Pattern-hit iterated BLAST

PSI-BLAST Protein Protein Position-specific iterated BLAST

RPS-BLAST Protein Protein Reverse PSI-BLAST

Yeast 5’ ss PWM

Slide 10

Table 3: Site-specific frequencies and position weight matrix (PWM) for 275 5′ ss. The consensus sequence (UAAAG ∣GUAUGUU UAAUU) can be obtained from those large site-specific PWM entries, with the most important sites in bold italics . The χ 2 test is performed for each site against the background frequencies (A = 0.3279, C = 0.1915, G = 0.2043, and U = 0.2763). The nucleotide sites are labeled with the five exon nucleotides as −5 to −1 and the 12 intron nucleotides as 1 to 12. The PWM is nearly identical when the introns in 5′ UTR were excluded.

Site A C G U χ 2 p A C G U

−5 94 32 57 92 11.798 0.0081088 0.0641 −0.7117 0.0245 0.2792

−4 119 47 48 61 14.117 0.0027505 0.4032 −0.1599 −0.2225 −0.3115

−3 139 38 43 55 39.672 0.0000001 0.6268 −0.4651 −0.3805 −0.4601

−2 138 40 36 61 38.899 0.0000001 0.6164 −0.3915 −0.6355 −0.3115

−1 91 45 88 51 27.270 0.0000052 0.0174 −0.2223 0.6492 −0.5685

1 0 1 274 0 1060.426 0.0000004 −8.1042 −5.4675 2.2855 −8.1044

2 0 9 0 266 658.096 0.0000003 −8.1042 −2.5200 −8.1048 1.8081

3 268 1 2 4 522.754 0.0000003 1.5723 −5.4675 −4.6732 −4.1523

4 17 29 1 228 428.607 0.0000002 −2.3805 −0.8528 −5.5454 1.5859

5 2 0 272 1 1041.047 0.0000004 −5.2765 −8.1049 2.2750 −5.8967

6 10 8 2 255 583.545 0.0000003 −3.1271 −2.6862 −4.6732 1.7472

7 97 18 39 121 55.570 0.0000001 0.1092 −1.5351 −0.5206 0.6734

8 95 54 35 91 11.363 0.0099180 0.0793 0.0397 −0.6759 0.2635

9 123 45 34 73 22.172 0.0000601 0.4508 −0.2223 −0.7175 −0.0534

10 118 41 38 78 17.334 0.0006034 0.3911 −0.3560 −0.5579 0.0418

11 105 33 43 94 17.367 0.0005940 0.2232 −0.6676 −0.3805 0.3101

12 90 44 42 99 12.109 0.0070180 0.0015 −0.2546 −0.4142 0.3847

Ma and Xia 2011

Yeast 3’ss PWM

Slide 11

Site A C G U 2 p A C G U-12 61 53 34 130 56.0 0 -0.5608 0.0033 -0.7205 0.7648-11 70 47 20 141 83.2 0 -0.3649 -0.1682 -1.4696 0.8813-10 79 42 12 145 99.9 0 -0.1926 -0.3285 -2.1802 0.9214

-9 38 30 23 187 219.4 0 -1.2308 -0.8067 -1.2731 1.2867-8 51 42 27 158 121.6 0 -0.8149 -0.3285 -1.0470 1.0447-7 91 33 28 126 53.8 0 0.0093 -0.6715 -0.9956 0.7200-6 95 42 35 106 22.0 0.0001 0.0707 -0.3285 -0.6794 0.4722-5 93 33 23 129 63.3 0 0.0403 -0.6715 -1.2731 0.7537-4 136 25 38 79 43.3 0 0.5842 -1.0647 -0.5626 0.0517-3 12 121 0 145 272.3 0 -2.8223 1.1862 -6.6480 0.9214-2 277 1 0 0 563.7 0 1.6056 -5.1232 -6.6480 -6.6469-1 0 0 278 0 1082.7 0 -6.6464 -6.6483 2.2900 -6.64691 93 37 73 75 9.7 0.0217 0.0403 -0.5089 0.3691 -0.02262 72 64 54 88 8.0 0.0466 -0.3248 0.2729 -0.0619 0.20593 90 54 48 86 2.5 0.4771 -0.0065 0.0300 -0.2299 0.17304 83 43 54 98 8.7 0.0337 -0.1221 -0.2950 -0.0619 0.35995 90 65 37 86 10.6 0.0140 -0.0065 0.2951 -0.6005 0.1730

Table 4. Site-specific frequencies and position weight matrix (PWM) for 278 3’ ss. The consensus sequence (UUUUUUUUAYAG|GCUUC) can be obtained from those large site-specific PWM entries, with the most important sites in bold italics. The 2 test is performed for each site against the expected background frequencies. The sites are labeled with first exon site as 1.

Ma and Xia 2011

Slide 12

(a)

(b)

PWMS as a proxy of splicing strength

Slide 13

5' ss 3' ss

NRG RG NRG RG

PWMS Mean 8.8138 11.1978 5.3129 7.1762

PWMS Var. 31.5069 4.8646 13.3017 8.2077

N 44 202 49 229

t -4.6346 -3.9257

p 0.0000 0.0001

Table 6. Position weight matrix scores (PWMS, as a proxy for splicing strength) is significantly smaller for splice sites from intron-containing genes (ICGs) whose transcripts failed to recruit U1 snRNPs (NRG for non-recruiting group) than for those from ICGs whose transcripts binds well to U1 snRNPs (RG for recruiting group). The pattern is consistent for both 5' ss and 3' ss, based on two-sample t-tests assuming equal variances. Mann-Whitney tests yield the same conclusion.

0.9

0.92

0.94

0.96

0.98

1

1.02

1.04

1.06

1.08

1.1

1.12

0 5 10 15 20 25 30 35

Gene expression

Sp

licin

g e

ffic

ien

cyHighly expressed genes should have high splicing efficiency.

Lowly

expr

esse

d gen

es co

uld ha

ve th

eir

splic

ing si

tes dr

ifting

to lo

w effic

iency Predictions:

(1) Highly transcribed genes should, on average, have introns with greater splicing efficiency(2) Lowly transcribed genes should have greater variance in splicing efficiency than highly transcribed genes.

PWMS and Gene Expression

Slide 15

PWMS and Splicing Mechanisms• Expected PWMS is 0 when there is no site-specific

difference in nucleotide frequency distribution• What does a strongly negative PWMS mean?• 5’ ss:

– HAC1: -8.8291– HFM1: -7.3825– HOP2: -7.8898

• 3’ ss: – HAC1: -4.4039– REC102: -3.4464

Slide 16

Slide 17

Perceptron• The perceptron is one of the simplest artificial neural

networks invented in 1957 at the Cornell Aeronautical Laboratory by Frank Rosenblatt (Rosenblatt, 1958).

• Perceptron has been used in bioinformatics research since 1980s:– The identification of translational initiation sites in E. coli

(Stormo et al., 1982a).– Characterizing the ATP/GTP-binding motif (Hirst and

Sternberg, 1991).– More recent publications use multi-layer perceptrons

which is more complicated than what we cover here.

Slide 18

What perceptron does• Positive sequencesPOS1 ACGTPOS2 GCGC

• Negative sequencesNEG1 AGCTNEG2 GGCC

• Objective: Find a scoring matrix that can distinguish between the two groups (positive and negative) of sequences

Slide 19

Definitions

Table 5-3. The weighting matrix (W) for the fictitious example with two sequences of length 4 in each group, initialized with values of 1. The first row designates sites 1-4.

Base 1 2 3 4

A 1 1 1 1

C 1 1 1 1

G 1 1 1 1

T 1 1 1 1

POS1 ACGTPOS2 GCGC

NEG1 AGCTNEG2 GGCC

,1

j

L

S jj

PS W

1 ,1 ,2 ,3 ,4 4POS A C G TPS W W W W

, ,

, ,

,

1, if S is from POS group and PS < 0

1, if S is from NEG group and PS 0

No change in otherwise.

One may initialize all values to 0 instead of 1.

j j

j j

j

S j S j

S j S j

S j

W W

W W

W

For amino acid sequences, the matrix would be 20 by 4.

Slide 20

Iterations and convergence

First round of the training process in the perceptron algorithm. Updated values are highlighted in bold. NEG1: AGCT, PS = 4, update Base 1 2 3 4 A 0 1 1 1 (a) C 1 1 0 1 G 1 0 1 1 T 1 1 1 0 NEG2: GGCC, PS = 2, update A 0 1 1 1 C 1 1 -1 0 (b) G 0 -1 1 1 T 1 1 1 0 POS1: ACGT, PS = 2, no update A 0 1 1 1 C 1 1 -1 0 (c) G 0 -1 1 1 T 1 1 1 0 POS2: GCGC, PS = 2, no update A 0 1 1 1 C 1 1 -1 0 (d) G 0 -1 1 1 T 1 1 1 0

Slide 21

Post-processing

Base 1 2 3 4

A 0 1 1 1

C 1 1 -1 0

G 0 -1 1 1

T 1 1 1 0

Base 1 2 3 4

A 0 0 0 0

C 0 1 -1 0

G 0 -1 1 0

T 0 0 0 0

POS1 ACGTPOS2 GCGC

NEG1 AGCTNEG2 GGCC

What is the scorefor:

TAAA?

A WSi,j = 0 means either there is no data on that cell or the cell has no discriminant power

Doublet perceptron

Slide 22

1234567890P1 ACGUAUACGUP2 ACGUCUACGUP3 ACGUGUACGUP4 ACGUUAACGUP5 ACGUUCACGUP6 ACGUUGACGU

N1 ACGUAAACGUN1 ACGUACACGUN1 ACGUAGACGUN1 ACGUCAACGUN1 ACGUCCACGUN1 ACGUCGACGUN1 ACGUGAACGUN1 ACGUGCACGUN1 ACGUGGACGUN1 ACGUUUACGU

1 2 3 4 5 6 7 8 9AC CG GU UA AU UA AC CG GUAC CG GU UC CU UA AC CG GUAC CG GU UG GU UA AC CG GUAC CG GU UU UA AA AC CG GUAC CG GU UU UC CA AC CG GUAC CG GU UU UG GA AC CG GU

AC CG GU UA AA AA AC CG GUAC CG GU UA AC CA AC CG GUAC CG GU UA AG GA AC CG GUAC CG GU UC CA AA AC CG GUAC CG GU UC CC CA AC CG GUAC CG GU UC CG GA AC CG GUAC CG GU UG GA AA AC CG GUAC CG GU UG GC CA AC CG GUAC CG GU UG GG GA AC CG GUAC CG GU UU UU UA AC CG GU

Doublet Perceptron

Slide 23

Doublet1 2 3 4 5 6 7 8 9

AA 0 0 0 0 -6 -4.3 0 0 0AC 0 0 0 0 -4 0 0 0 0AG 0 0 0 0 -2 0 0 0 0AU 0 0 0 0 8.33 0 0 0 0CA 0 0 0 0 -4 -1 0 0 0CC 0 0 0 0 -1 0 0 0 0CG 0 0 0 0 -1 0 0 0 0CU 0 0 0 0 5 0 0 0 0GA 0 0 0 0 -1 -0.7 0 0 0GC 0 0 0 0 -1 0 0 0 0GG 0 0 0 0 -1 0 0 0 0GU 0 0 0 0 3.33 0 0 0 0UA 0 0 0 -3.7 6.67 5.67 0 0 0UC 0 0 0 -1 5 0 0 0 0UG 0 0 0 0.33 3.33 0 0 0 0UU 0 0 0 4 -11 0 0 0 0

Large amount of data are needed to avoid the problem of overfitting

Gene/Motif Prediction• Objective: given molecular sequence, find its biological

function (preferably in terms of gene ontology). – Cellular localization– Biological processes the gene (its product) participates in– The biological reaction

• Related terms:– Motif: e.g., RccAUGG– Fingerprint: a set of aligned sequences from which a position weight

matrix or the like can be constructed to predict the motif effectively• Gene/Motif prediction methods

– Position weight matrix– Perceptrons– Supervised learning– Hidden Markov Models (HMMs)– Neural networks (e.g., self-organizing map or SOM)