Download - Identifying conserved segments in rearranged and divergent genomes

Identifying conserved segments in rearranged and divergent

genomes

Bob Mau, Aaron Darling, Nicole T. Perna

Presented by Aaron Darling

Comparing genomic architectures

Genome sequence and architecture comparison can lead to insight about organismal

• Evolutionary forces• Gene functions• Phenotypes

Rearrangement, gene gain, loss, and duplication obfuscate homology

Structure of the bacterial chromosome

Origin of replication

Terminus

Replication proceeds simultaneously on each “replichore”

Breakpoints of inversions occur an equal distance from the origin to maintain replichore balance.

(Tillier and Collins 2000, Ajana et. al. 2002)

We call such rearrangements “symmetric inversions”

Replichore size difference > 20% is selected against (Guijo et. al. 2001)

A dot plot: Each dot is a pairwise (or n-way) local alignment

Goal: Identify local homologous (orthologous) segments

Blue:

Same strand

Red:

Opposite strand

Tools for segmental homology detection

GRIMM-Synteny (Pevzner et. al. 2003, Bourque et. al. 2004)

- cluster markers within a fixed distance

FISH (Vision et. al. 2003)- find statistically over-represented

clusters of markers within a fixed distance

LineUp (Hampson et. al. 2003)- find collinear runs of markers among

pairs of genomes, allowing degeneracy

Some alignment tools:Shuffle-LAGAN (Brudno et. al. 2003), Mauve (Darling et. al. 2004)

Small segments separated by lineage-specific regions may not be detected by methods based strictly on distance.

Key idea: use a combination of conserved marker order (collinearity) and alignment score

Finding conserved regions: A pseudo-Gibbs sampler method

Given: A set of M monotypic markers MDo: Assign a posterior probability that any marker m є

M is part of a conserved region

Use MCMC methodology to sample the frequency of

each marker’s inclusion in high-scoring configurations.

Use frequency as an estimate of “posterior probability”

Finding conserved regions: A pseudo-Gibbs sampler method

Define a configuration X as a vector of length M ofbinary random variables:

e.g. X = ( X1, X2, …, XM )

A configuration value xj maps marker mj to either signal (1) or noise (0)

e.g. x = (0,1,0,0,1,1,…,1,0)There are 2M possible configurationsRun a Markov chain of length N over configuration space: (X1, X2, …, XN)

Sample possible marker configurations

Start with a random initial configuration, THEN:

Select a marker, sample whether it should be a 0 or 1 based on the current configuration

R

jvvvj

j

Lvvvj xwwxwmScore

1

1

)|( x

Sum of scores for all collinear markers to the left

Sum of scores for all collinear markers to the right

Score of marker j

wv is the score of marker v, xv is the configuration value (0 or 1)

Transform LCB score to probability

The scale parameter c is used in tandem with the sigmoid to map a marker’s score to a probability:

1

1)|1( /)(

/)(1

cmScore

cmScorenn

j j

j

e

eXP x

Sample a new value for xj

Set xj to 1 with probability given by the marker’s

score transformation

First allow the chain a “burn-in” period, then

continue for many iterations.

The frequency, or “posterior probability” of mj is:

samples of #

1 samples of #

Our method assigns each marker a p.p.

Threshold γ separates signal from noise

Our method assigns each marker a p.p.

Using γ = .5, the X pattern appears

Application to 4 divergent Streptococcus

Markers are reciprocal best blastp hits of ORFs among:

S. agalactiae

S. pyogenes

S. pneumoniae

S. mutans

S. pneumoniae

What is the distribution of segment sizes in Streptococci?As resolution increases, large segments are broken up by

smaller segments

3

11

29

7 72

61 3 1 2 1 0 0 2

0

5

10

15

20

25

30

35

2 3 4 5 6 7 8 9 10 11 13 14 17 18 24

Nu

mb

er o

f L

CB

s

Segment sizes (Markers per segment)

c = 75, γ = .45

“Low resolution”

c = 30, γ = .45

“Medium resolution”

c = 20, γ = .50

“High-1 resolution”

c = 20, γ = .30

“High-2 resolution”

14

20

7 72

62 3 1 2 0 0 1 2

0

5

10

15

20

25

30

35

2 3 4 5 6 7 8 9 10 11 13 14 17 18 24

Nu

mb

er o

f L

CB

s

0 0 2 4 62

61

41 1 0 2 1 2

0

5

10

15

20

25

30

35

2 3 4 5 6 7 8 9 10 11 13 14 17 18 24

Nu

mb

er o

f L

CB

s

0 0 0 1 2 16

25 3 1 0 2 1 2

0

5

10

15

20

25

30

35

2 3 4 5 6 7 8 9 10 11 13 14 17 18 24

Nu

mb

er o

f L

CB

s

26

32

57

72

Total Segments

What was the ancestral genome organization?

Try building inversion phylogeny by applying GRIMM and MGR to the 57 high resolution segments



Failed: The suggested rearrangements do not maintain replichore balance



Failed: The suggested rearrangements do not maintain replichore balance

Try using the 26 larger, low resolution segments

Surprise! A success:

Transforming S. agalactiae into S. pyogenes

Conclusions

- The pseudo-Gibbs sampler method detects

collinear segments at a variety of scales

- It would be nice to have an inversion phylogeny

inference tool that accounts for replichore balance!

- Large segments in Streptococci appear to

rearrange by symmetric inversions

- Small segments? An open problem.

Future directions

Can a biologically relevant full joint probability distribution be expressed over configurations?

- If so, then a true Gibbs sampler could be employed

Problems:- Some rearrangements occur with different

frequency (e.g. symmetric inversions about the terminus vs. IS-mediated translocation)

- Distinguish rearrangement by H.T., gene duplication and subsequent loss, symmetric inversion, etc.

Acknowledgements

Bob Mau – did most of this workMy Ph.D. advisers:

Nicole T. Perna and Mark Craven

Others who have contributed insight:Jeremy Glasner, Fred Blattner, Eric CabotGEL@UW-Madison

Grant $. Money : NIH Grant GM62994-02. NLM Training Grant 5T15M007359-03 to A.E.D.

Download - Identifying conserved segments in rearranged and divergent genomes

Top Related