klaudia walter, wally gilks, lorenz wernisch 12 th december 2006 humanhuman modelling the boundary...

55
Klaudia Walter, Wally Gilks, Lorenz Wernisch 12 th December 2006 HUMAN Modelling the Boundary of Highly Conserved Non-Coding DNA

Upload: kerry-hopkins

Post on 31-Dec-2015

216 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Klaudia Walter, Wally Gilks, Lorenz Wernisch 12 th December 2006 HUMANHUMAN Modelling the Boundary of Highly Conserved Non-Coding DNA

Klaudia Walter, Wally Gilks, Lorenz Wernisch

12th December 2006

HUMAN

Modelling the Boundary of Highly Conserved Non-Coding DNA

Page 2: Klaudia Walter, Wally Gilks, Lorenz Wernisch 12 th December 2006 HUMANHUMAN Modelling the Boundary of Highly Conserved Non-Coding DNA

Overview

• Background

– What are CNEs?

– A+T nucleotide frequency in and around CNEs

• Phylogenetic Model

– What is a phylogenetic tree model?

– Likelihood of a tree model

– Likelihood of the scaling of a tree

– Likelihood of CNE boundary

– Variable CNE boundaries for each species

Page 3: Klaudia Walter, Wally Gilks, Lorenz Wernisch 12 th December 2006 HUMANHUMAN Modelling the Boundary of Highly Conserved Non-Coding DNA

Motivation

• DNA sequences that are conserved between organisms are likely to have special functions.

• The Fugu genome represents a good model to find conserved non-coding sequences (CNEs) in the human genome.

• Are conserved regions different from their neighbouring sequences in the genome?

• Is it possible to define CNE boundaries better than with pairwise sequence alignment of Fugu and human?

Page 4: Klaudia Walter, Wally Gilks, Lorenz Wernisch 12 th December 2006 HUMANHUMAN Modelling the Boundary of Highly Conserved Non-Coding DNA

What are CNEs?

Page 5: Klaudia Walter, Wally Gilks, Lorenz Wernisch 12 th December 2006 HUMANHUMAN Modelling the Boundary of Highly Conserved Non-Coding DNA

Multiple Alignment of Mouse, Rat, Human and Fugu

Page 6: Klaudia Walter, Wally Gilks, Lorenz Wernisch 12 th December 2006 HUMANHUMAN Modelling the Boundary of Highly Conserved Non-Coding DNA

Fugu Genome

• Fugu genome contains only 400Mb.

• Only an eighth of human genome.

• Gene repertoire is similar to human.

• Human and Fugu shared last common ancestor 450 million years ago.

(Brenner et al, 1993; Aparicio et al, 2002)

Page 7: Klaudia Walter, Wally Gilks, Lorenz Wernisch 12 th December 2006 HUMANHUMAN Modelling the Boundary of Highly Conserved Non-Coding DNA

Conserved Non-coding Elements (CNE)

• 1373 CNEs identified in human and Fugu

• 93 - 740 bp long; 68 - 98% identical

• Situated around developmental genes

• Can act over 1 Mb distance, eg. Shh expression (Lettice et al, 2003; Nobrega et al, 2003;

Kleinjan & van Heyningen, 2004)

• Likely to be fundamental to vertebrate life

(Dermitzakis et al, 2002, 2003; Margulies et al, 2003; Bejerano et al 2004a; Woolfe et al, 2005)

Page 8: Klaudia Walter, Wally Gilks, Lorenz Wernisch 12 th December 2006 HUMANHUMAN Modelling the Boundary of Highly Conserved Non-Coding DNA

Are vertebrate CNEs enhancers?

Coding Exon

Conserved Non-coding Sequence

SOX21 gene

Fugu / Mouse

Fugu / Human Fugu / Rat

element 1element 1

element 19element 19

element 4element 4 element 5element 5

element 8-10element 8-10

Page 9: Klaudia Walter, Wally Gilks, Lorenz Wernisch 12 th December 2006 HUMANHUMAN Modelling the Boundary of Highly Conserved Non-Coding DNA

sox21 gene element 19

central nervous system

forebrain

eye

Element 19

(Woolfe et al, 2005; McEwen et al, 2006)

Page 10: Klaudia Walter, Wally Gilks, Lorenz Wernisch 12 th December 2006 HUMANHUMAN Modelling the Boundary of Highly Conserved Non-Coding DNA

CNE Target

Model of duplication of cis-element and target gene

(Vavouri et al, 2006; McEwen et al, 2006)

Page 11: Klaudia Walter, Wally Gilks, Lorenz Wernisch 12 th December 2006 HUMANHUMAN Modelling the Boundary of Highly Conserved Non-Coding DNA

A+T base frequency in CNEs

Page 12: Klaudia Walter, Wally Gilks, Lorenz Wernisch 12 th December 2006 HUMANHUMAN Modelling the Boundary of Highly Conserved Non-Coding DNA

Position Specific Base Composition

Upstream flanking region Conserved non-coding

ACTAGCCTCATCGTAGCGCAATTCTAGATGATAACATACCGAGTTCGGTAGGAGCTTAGTATGAGCATAACGCGTGTGCTAGGTCACGGCGCAACATACTTATAGACTACGCCCTTGCACGATCCGGATATCATAGTCTTACAA

A = 0.00C = 0.25G = 0.50T = 0.25

A = 0.50C = 0.00G = 0.25T = 0.25

Page 13: Klaudia Walter, Wally Gilks, Lorenz Wernisch 12 th December 2006 HUMANHUMAN Modelling the Boundary of Highly Conserved Non-Coding DNA

A+T relative frequency across CNE boundaries in Fugu and human

(Walter et al, 2005)

Page 14: Klaudia Walter, Wally Gilks, Lorenz Wernisch 12 th December 2006 HUMANHUMAN Modelling the Boundary of Highly Conserved Non-Coding DNA

A+T relative frequency across 2000 genes in human chromosome 1

Genes were aligned at the start and the end.

Page 15: Klaudia Walter, Wally Gilks, Lorenz Wernisch 12 th December 2006 HUMANHUMAN Modelling the Boundary of Highly Conserved Non-Coding DNA

Distribution of Position Weight Matrix (PWM) Scores for CNEs and Random Sequences

A position weight matrix (PWM) is constructed by dividing the nucleotide probabilities by expected background probabilities.

p(b,i) = probability of base b in position i p(b) = background probability of base b

n

i bp

ibpS

12 )(

),(log

Page 16: Klaudia Walter, Wally Gilks, Lorenz Wernisch 12 th December 2006 HUMANHUMAN Modelling the Boundary of Highly Conserved Non-Coding DNA

Scoresfor FuguCNEs

Page 17: Klaudia Walter, Wally Gilks, Lorenz Wernisch 12 th December 2006 HUMANHUMAN Modelling the Boundary of Highly Conserved Non-Coding DNA

Scoresfor HumanCNEs

Page 18: Klaudia Walter, Wally Gilks, Lorenz Wernisch 12 th December 2006 HUMANHUMAN Modelling the Boundary of Highly Conserved Non-Coding DNA

The sequence logo for the 100 top scoring CNEs.

Page 19: Klaudia Walter, Wally Gilks, Lorenz Wernisch 12 th December 2006 HUMANHUMAN Modelling the Boundary of Highly Conserved Non-Coding DNA

What do CNEs do?

• Some CNEs enhance GFP (green fluorescent protein) expression in zebrafish embryos.

• The function of CNEs is still unknown.

• Necessary to do more lab experiments.

• Are CNEs defined well enough for experiments?

Page 20: Klaudia Walter, Wally Gilks, Lorenz Wernisch 12 th December 2006 HUMANHUMAN Modelling the Boundary of Highly Conserved Non-Coding DNA

Conservation pattern across CNE boundaries

1373 Fugu-human CNE pairs plus 100bp flanking regions aligned using Needleman-Wunsch’s algorithm.

Page 21: Klaudia Walter, Wally Gilks, Lorenz Wernisch 12 th December 2006 HUMANHUMAN Modelling the Boundary of Highly Conserved Non-Coding DNA

A+T frequency in Fugu, Human, Worm and Fly

(Glazov et al, 2005; Vavouri et al, 2006 (submitted))

Page 22: Klaudia Walter, Wally Gilks, Lorenz Wernisch 12 th December 2006 HUMANHUMAN Modelling the Boundary of Highly Conserved Non-Coding DNA

Are CNE ends well defined?

• Different parameter settings produce different alignments.

• Even just different mismatch penalties change – the alignments– the A+T bias at the CNE boundaries

Page 23: Klaudia Walter, Wally Gilks, Lorenz Wernisch 12 th December 2006 HUMANHUMAN Modelling the Boundary of Highly Conserved Non-Coding DNA

A+T frequency for Fugu CNEs using pairwise alignments with Human

Page 24: Klaudia Walter, Wally Gilks, Lorenz Wernisch 12 th December 2006 HUMANHUMAN Modelling the Boundary of Highly Conserved Non-Coding DNA

Phylogenetic Model

Page 25: Klaudia Walter, Wally Gilks, Lorenz Wernisch 12 th December 2006 HUMANHUMAN Modelling the Boundary of Highly Conserved Non-Coding DNA

5’ flanking conservedHuman ACAGTAT ATCGTAATMouse ACCGTAT ATCGTAATChicken AACGTAT ATCGTAATXenopus CCACTAT ATCGTAATFugu CGACTTA ATCGTAAT

boundary

Multiple sequence alignment

300 bp 100 bp

Page 26: Klaudia Walter, Wally Gilks, Lorenz Wernisch 12 th December 2006 HUMANHUMAN Modelling the Boundary of Highly Conserved Non-Coding DNA

Phylogenetic tree model

• Substitution rate matrix– Continuous-time Markov process

• Tree topology• Branch lengths• Scaling of tree

AA AC AG AT

CA CC CG CT

GA GC GG GT

TA TC TG TT

q q q q

q q q qQ

q q q q

q q q q

H

M

C

F

Page 27: Klaudia Walter, Wally Gilks, Lorenz Wernisch 12 th December 2006 HUMANHUMAN Modelling the Boundary of Highly Conserved Non-Coding DNA

Matrix P(t) of substitution probabilities for branch length t

1

( )( ) exp( )

!

i

i

QtP t Qt

i

Q should be diagonalizable. If Q is not symmetric, we need to find the eigensystem of a symmetric matrix S related to Q and to convert results to the eigensystem of Q.

Example:

C G T

A G T

A C T

A C G

a b c

a d eQ

b d f

c e fA, C, G, T

Page 28: Klaudia Walter, Wally Gilks, Lorenz Wernisch 12 th December 2006 HUMANHUMAN Modelling the Boundary of Highly Conserved Non-Coding DNA

Estimating A+T frequency around Fugu CNE boundary

relative A+Tfrequency

Page 29: Klaudia Walter, Wally Gilks, Lorenz Wernisch 12 th December 2006 HUMANHUMAN Modelling the Boundary of Highly Conserved Non-Coding DNA

Mouse

Fugu

Xenopus

Chicken

Human

Conserved

scaling C

Mouse

Flanking

scaling F

Fugu

Xenopus

Chicken

Human

Phylogenetic tree with conserved and flanking scalings

Page 30: Klaudia Walter, Wally Gilks, Lorenz Wernisch 12 th December 2006 HUMANHUMAN Modelling the Boundary of Highly Conserved Non-Coding DNA

flanking scaling F

conserved scaling C

boundary position

sca

le

What is the optimal scaling?

Page 31: Klaudia Walter, Wally Gilks, Lorenz Wernisch 12 th December 2006 HUMANHUMAN Modelling the Boundary of Highly Conserved Non-Coding DNA

5’ flanking conserved

ACA G TATATCGTAATACC G TATATCGTAATAAC G TATATCGTAATCCA C TATATCGTAATCGA C TTAATCGTAAT

Compute likelihood of scaling

Felsenstein’s algorithm: P(s | T, )

HumanMouseChickenXenopusFugu

Page 32: Klaudia Walter, Wally Gilks, Lorenz Wernisch 12 th December 2006 HUMANHUMAN Modelling the Boundary of Highly Conserved Non-Coding DNA

Felsenstein’s algorithm

“Pruning” algorithm by Felsenstein (1973, 1981)

uses dynamic programming to calculate likelihood

of a tree model P(S |

Recursion:• If u is a leaf

If xu = a, then

Otherwise,

• Otherwise

( | ) = ( | , ) ( | ) ( | , ) ( | )u v v w wb c

P L a P b a t P L b P c a t P L c

( | ) = 1uP L a

( | ) = 0uP L ab

c

aw

u

v

Page 33: Klaudia Walter, Wally Gilks, Lorenz Wernisch 12 th December 2006 HUMANHUMAN Modelling the Boundary of Highly Conserved Non-Coding DNA

Likelihood of scaling

• Calculate likelihood P(S | T, ) of scaling vector by

summing over boundary b.

• Assume evolutionary independence of each position i

in the multiple alignment S.

• P(S | T, ) is calculated by Felsenstein’s algorithm.

1

( | , ) ( | , ) ( )N

b bb

P S T P S T P

Page 34: Klaudia Walter, Wally Gilks, Lorenz Wernisch 12 th December 2006 HUMANHUMAN Modelling the Boundary of Highly Conserved Non-Coding DNA

Model with common scaling and individual boundaries

1 11

( | ,..., ) ( ,..., | ) ( ) ( | ) ( )n

n n ii

P S S P S S P P S P

Probability of scaling given sequences S1, …, Sn

Page 35: Klaudia Walter, Wally Gilks, Lorenz Wernisch 12 th December 2006 HUMANHUMAN Modelling the Boundary of Highly Conserved Non-Coding DNA

Likelihood of scaling over CNEs

Page 36: Klaudia Walter, Wally Gilks, Lorenz Wernisch 12 th December 2006 HUMANHUMAN Modelling the Boundary of Highly Conserved Non-Coding DNA

Hierarchical model for

),|,(),|(),|(

),|(),|,...,,(

,

FCFC

Sn

PSPSP

SPSSSP

FC

21

S1 S2 S3 ..... Sn

CF)1 CF)2 CF)3 CF)n

Page 37: Klaudia Walter, Wally Gilks, Lorenz Wernisch 12 th December 2006 HUMANHUMAN Modelling the Boundary of Highly Conserved Non-Coding DNA

F

C C

Multivariate log normal distribution for (C, F)

Page 38: Klaudia Walter, Wally Gilks, Lorenz Wernisch 12 th December 2006 HUMANHUMAN Modelling the Boundary of Highly Conserved Non-Coding DNA

Likelihood of boundary b

• The likelihood of the boundary is computed by summing over scalings

• b and are independent.

• Prior on .

)(),|()|( PbSPbSP

Page 39: Klaudia Walter, Wally Gilks, Lorenz Wernisch 12 th December 2006 HUMANHUMAN Modelling the Boundary of Highly Conserved Non-Coding DNA

Likelihood of boundary b

Page 40: Klaudia Walter, Wally Gilks, Lorenz Wernisch 12 th December 2006 HUMANHUMAN Modelling the Boundary of Highly Conserved Non-Coding DNA

Boundary shifts for phylogenetic model

Boundary shift 0 bp ≤ 20bp ≤ 50bp ≤ 100bp

Cumulative frequency 12% 40% 61% 80%

density

position

Page 41: Klaudia Walter, Wally Gilks, Lorenz Wernisch 12 th December 2006 HUMANHUMAN Modelling the Boundary of Highly Conserved Non-Coding DNA

Relative conservation by position

Page 42: Klaudia Walter, Wally Gilks, Lorenz Wernisch 12 th December 2006 HUMANHUMAN Modelling the Boundary of Highly Conserved Non-Coding DNA

Model for variable boundary

000000 0 11111111000011 1 11111111000011 1 11111111000000 0 00111111000000 0 00111111000000 0 00111111000000 1 11111111000000 1 11111111 0 1

1

0

0

1 1

0

H M C X F

Branches

Positions

Page 43: Klaudia Walter, Wally Gilks, Lorenz Wernisch 12 th December 2006 HUMANHUMAN Modelling the Boundary of Highly Conserved Non-Coding DNA

Transitions

1. 0000 0001 0010 0011 ......... 1111

2. 0000 0001 0010 0011 ......... 1111

3. 0000 0001 0010 0011 ......... 1111

...... ...... ...... ...... ......

Page 44: Klaudia Walter, Wally Gilks, Lorenz Wernisch 12 th December 2006 HUMANHUMAN Modelling the Boundary of Highly Conserved Non-Coding DNA

Variable boundary for CNE1031

Human AGTAGTTTCC ATGCCTGTCAMouse AGGAGCCTCT ATGCCTGTCAChicken AGTAGTTTCC ATGCCTGTCAXenopus -GTTATATAC ACGCCTGTCAFugu AATAGTTCCC ATGCCTGTCA

10 bp 10 bp

Boundary shift = 154 bp

Page 45: Klaudia Walter, Wally Gilks, Lorenz Wernisch 12 th December 2006 HUMANHUMAN Modelling the Boundary of Highly Conserved Non-Coding DNA

Variable boundary for CNE1043

Human TGATGTTGAA TCATTTAAAAMouse TGATGTGTAG TCATTTAAAAChicken TGACGTTCAG TCAGTTAAAAXenopus TGACACTCAA TCATTTAAATFugu TGACGCGCAG TCAGTTAAAT

10 bp 10 bp

Boundary shift = 0 bp

Page 46: Klaudia Walter, Wally Gilks, Lorenz Wernisch 12 th December 2006 HUMANHUMAN Modelling the Boundary of Highly Conserved Non-Coding DNA

Variable boundary for CNE1037

Human TA-GGCCATT CTGATTTGTAMouse TA-GGCCATT CTGATTTGTAChicken TA-GGCCATT CTGATTTGTAXenopus AA-GACCATA CTGATTTTTTFugu TGTGGTAGGT CTGATTTGTA

10 bp 10 bp

Boundary shift = 65 bp

Page 47: Klaudia Walter, Wally Gilks, Lorenz Wernisch 12 th December 2006 HUMANHUMAN Modelling the Boundary of Highly Conserved Non-Coding DNA

Conservation structure of CNEs

Page 48: Klaudia Walter, Wally Gilks, Lorenz Wernisch 12 th December 2006 HUMANHUMAN Modelling the Boundary of Highly Conserved Non-Coding DNA

Summary

• Statistical models for CNE boundaries that incorporates phylogenetic information.

• Aim is to define location of CNE boundaries more reliably than pairwise or multiple sequence alignments.

Page 49: Klaudia Walter, Wally Gilks, Lorenz Wernisch 12 th December 2006 HUMANHUMAN Modelling the Boundary of Highly Conserved Non-Coding DNA

Acknowledgments

Greg Elgar (Queen Mary College, University of London)

Irina AbnizovaGayle McEwen (MRC Biostatistics Unit, Cambridge)Krys KellyBrian Tom

Tanya Vavouri (QMUL & Sanger Institute, Hinxton)

Adam Woolfe (NHGRI, National Institutes of Health, US)

Yvonne Edwards (University College, University of London)

Martin Goodson

Page 50: Klaudia Walter, Wally Gilks, Lorenz Wernisch 12 th December 2006 HUMANHUMAN Modelling the Boundary of Highly Conserved Non-Coding DNA

References

• Woolfe A, Goodson M, Goode DK, Snell P, McEwen GK, Vavouri T, Smith SF, North P, Callaway H, Kelly K, Walter K, Abnizova I, Gilks W, Edwards YJ, Cooke JE, Elgar G.

Highly conserved non-coding sequences are associated with vertebrate development. PLoS Biol. 2005, 3(1).

• Walter K, Abnizova I, Elgar G, Gilks WR. Striking nucleotide frequency pattern at the borders of highly conserved vertebrate

non-coding sequences. Trends Genet. 2005, 21(8):436-40.

• Vavouri T, McEwen GK, Woolfe A, Gilks WR, Elgar G. Defining a genomic radius for long-range enhancer action: duplicated conserved

non-coding elements hold the key. Trends Genet. 2006, 22(1):5-10.

• McEwen GK, Woolfe A, Goode D, Vavouri T, Callaway H, Elgar G. Ancient duplicated conserved noncoding elements in vertebrates: a genomic and

functional analysis. Genome Res. 2006,16(4):451-65.

• Vavouri T, Walter K, Gilks WR, Lehner B, Elgar G. Parallel evolution of conserved noncoding elements that target a common set of

developmental regulatory genes from worms to humans. Submitted 2006.

Page 51: Klaudia Walter, Wally Gilks, Lorenz Wernisch 12 th December 2006 HUMANHUMAN Modelling the Boundary of Highly Conserved Non-Coding DNA

Human CNE boundary

MegaBLAST Phylogenetic

A+Tfrequency

position position

Page 52: Klaudia Walter, Wally Gilks, Lorenz Wernisch 12 th December 2006 HUMANHUMAN Modelling the Boundary of Highly Conserved Non-Coding DNA

Chicken CNE boundary

MegaBLAST Phylogenetic

A+Tfrequency

position position

Page 53: Klaudia Walter, Wally Gilks, Lorenz Wernisch 12 th December 2006 HUMANHUMAN Modelling the Boundary of Highly Conserved Non-Coding DNA

Fugu CNE boundary

MegaBLAST Phylogenetic

A+Tfrequency

position position

Page 54: Klaudia Walter, Wally Gilks, Lorenz Wernisch 12 th December 2006 HUMANHUMAN Modelling the Boundary of Highly Conserved Non-Coding DNA

From rate matrix Q to probability matrix P

' , , ,

AA AC AG AT

CA CC CG CTA C G T

GA GC GG GT

TA TC TG TT

q q q q

q q q qp p Q p p p p

q q q q

q q q q

'

( )A A AA C CA G GA T TA

A AC AG AT C CA G GA T TA

p p q p q p q p q

p q q q p q p q p q

Page 55: Klaudia Walter, Wally Gilks, Lorenz Wernisch 12 th December 2006 HUMANHUMAN Modelling the Boundary of Highly Conserved Non-Coding DNA

P(t) of substitution probabilities (2)

1/ 2 1/ 2diag( ) diag( )

( , , , )A C G T

S Q

1/ 2 1/ 2

exp( ) diag(exp( )) ( )

exp( ) diag( )exp( ) diag( )

( ) exp( )

TSt V t V

Qt St

P t Qt

is symmetric with

S and Q have the same eigenvalues