spectrum-based de novo repeat detection in genomic sequences

45
SPECTRUM-BASED DE NOVO REPEAT DETECTION IN GENOMIC SEQUENCES Do Huy Hoang

Upload: colin

Post on 23-Feb-2016

38 views

Category:

Documents


0 download

DESCRIPTION

Spectrum-based de novo repeat detection in genomic sequences. Do Huy Hoang. Outline. Introduction What is a repeat? Why studying repeats? Related work SAGRI Algorithm Analysis Evaluation. Introduction. What is a repeat?(Definition). - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Spectrum-based  de novo  repeat detection in genomic sequences

SPECTRUM-BASED DE NOVO REPEAT DETECTION IN GENOMIC SEQUENCES

Do Huy Hoang

Page 2: Spectrum-based  de novo  repeat detection in genomic sequences

OUTLINE Introduction

What is a repeat? Why studying repeats?

Related work SAGRI

Algorithm Analysis

Evaluation

Page 3: Spectrum-based  de novo  repeat detection in genomic sequences

INTRODUCTION

Page 4: Spectrum-based  de novo  repeat detection in genomic sequences

WHAT IS A REPEAT? (DEFINITION) [General]: Nucleotide sequences occurring multiply

within a genome

[CompBio]: Given a genome sequence S, find a string P which occurs at least twice in S (allowing some errors).

Page 5: Spectrum-based  de novo  repeat detection in genomic sequences

WHAT IS A REPEAT? (FUNCTION) Motifs

Very short repeats (10-20bp) Transcription factor binding sites

Long and Short interspersed elements (SINE, LINE) Jumping genes

Genes and Pseudogenes

Tandem repeats Simple short sequence repeats An, CGGn

Page 6: Spectrum-based  de novo  repeat detection in genomic sequences

WHY STUDYING REPEATS? (1) Eukaryotic genomes contain a lot of repeats

E.g. Human genome contains 50% repeats.

Repeats are believed to play an important role in evolution and disease. E.g. Alu elements are particularly prone to recombination. Insertion of

Alu repeats inactivate genes in patient with hemophilia and neurofibromatosis (Kazazian, 1998; Deininger and Batzer, 1999)

Repeats are important to chromatin structure. Most TEs in mammals seem to be silenced by methylation. Alu

sequences are major target for histone H3-Lys9 methylation in humans (Kondo and Issa, 2003).

It is known that heterochromatin have a lot of SINE and LINE repeats.

Page 7: Spectrum-based  de novo  repeat detection in genomic sequences

WHY STUDYING REPEATS? (2)

Repeats complicated sequence assembly and genome comparison Many people remove repeats before they analyze the genome.

Repeats set hurdles on microarray probe signal analysis The probe signal may be inaccurate if the probe sequence

overlap with repeat regions.

Repeats may contribute to human diversity more than genes.

Repeats can be used as DNA fingerprint

Page 8: Spectrum-based  de novo  repeat detection in genomic sequences

STEPS IN REPEAT FINDING Repeat library (RepeatMasker) De-novo repeat discovery (two steps):

Identification of repeats Classification of repeats

Page 9: Spectrum-based  de novo  repeat detection in genomic sequences

SAGRI ALGORITHM

Page 10: Spectrum-based  de novo  repeat detection in genomic sequences

ALGORITHM OUTLINE Input: a text G

FindHit phase: finds all candidate of second occurrence of repeat regions ACGACGCGATTAACCCTCGACGTGATCCTC

Validation phase: uses hits from phase 1 to find all pairs of repeats ACGACGCGATTAACCCTCGACGTGATCCTC

Page 11: Spectrum-based  de novo  repeat detection in genomic sequences

SPECTRUM-BASED REPEAT FINDER What is a spectrum?

Given a string G, its spectrum is the set of all k-mers.

E.g. k=3, G= ACGACGCTCACCCT

The spectrum is ACC, ACG, CAC, CCC, CCT, CGA, CGC, CTC, GAC, GCT, TCA

CTC is a k-mer occurring at position 7. ACG is a k-mer occurring at positions 1, 4.

Page 12: Spectrum-based  de novo  repeat detection in genomic sequences

OBSERVATION 1: HOW TO FIND CANDIDATE REGIONS CONTAINING REPEATS? Two regions of repeats should share some k-mers.

E.g. the following repeats share CGA.

ACGACGCGATTAACCCTCGACGTGATCCTC

Page 13: Spectrum-based  de novo  repeat detection in genomic sequences

FEASIBLE EXTENSION (BUD) iS = ACGACGTGATTAACCCTCGACGTGATCCTC

Given the spectrum S for G[1..i-1]:

A XC G XT

CGAFeasible extensions!

i

Note: T is called a fooling probe!

Page 14: Spectrum-based  de novo  repeat detection in genomic sequences

OBSERVATION 2 A path of feasible extensions may be a repeat.

Example:S = ACGACGCTATCGATGCCCTC

Spectrum S for G[1..10] isACG, CGA, CGC, CTA, GAC, GCT, TAT

Starting from position 11, there exists a path of feasible extensions:CGA-C-G-C

This path corresponds to a length-6 substring in position 2.Also, this path has one mismatch compare with the length-6 substring for

position 11 (CGATGC).

11

Page 15: Spectrum-based  de novo  repeat detection in genomic sequences

PHASE 1: FINDHIT()Algorithm:Input: a text G Initialize the empty spectrum S For i = 1 to n

/* we maintain the variant that S is a spectrum for G[1..i-1] */ Let x be the k-mer at position i If x exists in S, run DetectRepSeq(S,i); Insert x into S

Note: DetectRepSeq(S,i) looks for repeat occurring at position i.

Page 16: Spectrum-based  de novo  repeat detection in genomic sequences

ACGAAGTGATTAACCCTCGACGCGATCC

18 19 20 21 22 23 24 25 26 27 28

… 18 19 20 21 22 23 24 25 26 27 28

CGA C G C G A T C T

DetectRepSeg(S(18), 18)

AACAAGACCACGAGTATTCCCCCTCGACTCGAAGATGTGTAATCGTGATTA

CGA

1 2 …

RefCurr

Page 17: Spectrum-based  de novo  repeat detection in genomic sequences

ACGAAGTGATTAACCCTCGACGCGATCC

18 19 20 21 22 23 24 25 26 27 28

… 18 19 20 21 22 23 24 25 26 27 28

CGA C G C G A T C T

DetectRepSeg(S(18), 18)

AACAAGACCACGAGTATTCCCCCTCGACTCGAAGATGTGTAATCGTGATTA

CGA-T1

1 2 …

RefCurr

Page 18: Spectrum-based  de novo  repeat detection in genomic sequences

ACGAAGTGATTAACCCTCGACGCGATCC

18 19 20 21 22 23 24 25 26 27 28

… 18 19 20 21 22 23 24 25 26 27 28

CGA C G C G A T C T

DetectRepSeg(S(18), 18)

AACAAGACCACGAGTATTCCCCCTCGACTCGAAGATGTGTAATCGTGATTA

CGA-T1-T2-A3* A1-G1-T2-G2-A2-T2-T3*

C2-C2-C3* G3*

1 2 …

RefCurr

Page 19: Spectrum-based  de novo  repeat detection in genomic sequences

ACGAAGTGATTAACCCTCGACGCGATCC

18 19 20 21 22 23 24 25 26 27 28

… 18 19 20 21 22 23 24 25 26 27 28

CGA C G C G A T C T

DetectRepSeg(S(18), 18)

AACAAGACCACGAGTATTCCCCCTCGACTCGAAGATGTGTAATCGTGATTA

CGA-T1-T2-A3* A1-G1-T2-G2-A2-T2-T3*

C2-C2-C3* G3*

1 2 …

RefCurr

Page 20: Spectrum-based  de novo  repeat detection in genomic sequences

ACGAAGTGATTAACCCTCGACGCGATCC

18 19 20 21 22 23 24 25 26 27 28

… 18 19 20 21 22 23 24 25 26 27 28

CGA C G C G A T C T

DetectRepSeg(S(18), 18)

AACAAGACCACGAGTATTCCCCCTCGACTCGAAGATGTGTAATCGTGATTA

CGA-T1-T2-A3* A1-G1-T2-G2-A2-T2-T3*

C2-C2-C3* G3*

1 2 …

RefCurr

Page 21: Spectrum-based  de novo  repeat detection in genomic sequences

ACGAAGTGATTAACCCTCGACGCGATCC

18 19 20 21 22 23 24 25 26 27 28

… 18 19 20 21 22 23 24 25 26 27 28

CGA C G C G A T C T

DetectRepSeg(S(18), 18)

AACAAGACCACGAGTATTCCCCCTCGACTCGAAGATGTGTAATCGTGATTA

CGA-T1-T2-A3* A1-G1-T2-G2-A2-T2-T3*

C2-C2-C3* G3*

1 2 …

RefCurr

Page 22: Spectrum-based  de novo  repeat detection in genomic sequences

ACGAAGTGATTAACCCTCGACGCGATCC

18 19 20 21 22 23 24 25 26 27 28

… 18 19 20 21 22 23 24 25 26 27 28

CGA C G C G A T C T

DetectRepSeg(S(18), 18)

AACAAGACCACGAGTATTCCCCCTCGACTCGAAGATGTGTAATCGTGATTA

CGA-T1-T2-A3* A1-G1-T2-G2-A2-T2-T3*

C2-C2-C3* G3*

1 2 …

RefCurr

Page 23: Spectrum-based  de novo  repeat detection in genomic sequences

ACGAAGTGATTAACCCTCGACGCGATCC

18 19 20 21 22 23 24 25 26 27 28

… 18 19 20 21 22 23 24 25 26 27 28

CGA C G C G A T C T

DetectRepSeg(S(18), 18)

AACAAGACCACGAGTATTCCCCCTCGACTCGAAGATGTGTAATCGTGATTA

CGA-T1-T2-A3* A1-G1-T2-G2-A2-T2-T3*

C2-C2-C3* G3*

1 2 …

RefCurr

Page 24: Spectrum-based  de novo  repeat detection in genomic sequences

OTHER DETAILS Extend backward Stop backtracking after h steps

Page 25: Spectrum-based  de novo  repeat detection in genomic sequences

VALIDATION PHASE Decompose hits into set of k-mer and index all the

locations of these k-mers. Scan for each pair of locations of a k-mer w in the hits,

do BLAST extension Use some auxiliary data structure to avoid double checking

Report the pairs whose length exceed our threshold

Page 26: Spectrum-based  de novo  repeat detection in genomic sequences

ANALYSIS

Page 27: Spectrum-based  de novo  repeat detection in genomic sequences

ANALYSIS How to find most repeats?

Avoid false negative

How to get better speed? Avoid false positive

Page 28: Spectrum-based  de novo  repeat detection in genomic sequences

HOW DO WE CHOOSE K? (1) If k is too big,

k-mer is too specific and we may miss some repeat

If k is too small, k-mer cannot help us to differentiate repeat from non-repeat

For repeat of length 50 and similarity>0.9, we found that k log4n+2 is good enough.

Page 29: Spectrum-based  de novo  repeat detection in genomic sequences

HOW DO WE CHOOSE K? (2)

A random k-mer match with one of n chosen k-mer

Pr(a k-mer re-occurs by random in a sequence of length n) (analog to throwing n balls into 4k bins) 1-(1 – 4-k)m 1 – exp(-m/4k).

We requires 1-exp(-n/4k)1, hence, k log4n + log41. If we set 1=1/16, k log4n + 2

0 m

Page 30: Spectrum-based  de novo  repeat detection in genomic sequences

THE OCCURRENCE OF FALSE NEGATIVE (MISSED REPEAT) (1) A pair of repeats of length L, with m mismatches

Probability of a preserved k-mer in repeat is

M is the number of nonnegative integer solutions

to Subject to

mL

M /1

mLxxx m 121

1,,,0 121 kxxx m

L

X

x1 x2 Xm+1

X

Page 31: Spectrum-based  de novo  repeat detection in genomic sequences

THE OCCURRENCE OF FALSE NEGATIVE (MISSED REPEAT) (2) It is easy to see that M is the coefficient of xL−m in

Hence

1

1112

)1()1()1(

m

mkmk

xxxxx

m

jkLj

mM

kmLmj

j 1)1(/)()1(0

Page 32: Spectrum-based  de novo  repeat detection in genomic sequences

CRITERION FOR PATH TERMINATION (1) Instead of fixing the number of mismatches, we may

want to fixed the percentage of mismatches, says, 10%.

Then, the pruning strategy is length dependent. If the length of strings in is r, we allow (r) mismatches.

Page 33: Spectrum-based  de novo  repeat detection in genomic sequences

CRITERION FOR PATH TERMINATION (2) Let q be the mismatch probability and r be the length of the

string. Prob that a string has s mismatches =

For a threshold (says, 0.01), we set (r) = max {2 s r-2 | Pq(s) > } + 2

2

2

22 )1(2

)(r

sj

jrjq qq

jr

qsP

Page 34: Spectrum-based  de novo  repeat detection in genomic sequences

CONTROL OF FALSE POSITIVES (1) Two typical cases

The probability of (case 1)/ (case 2) is 2*4- P(case1 or case2) is small

For example: 4 errors, q=0.1, k = 12, P(case 1) = 1.77 * 10-8

Page 35: Spectrum-based  de novo  repeat detection in genomic sequences

EVALUATIONCompare with other programs

Page 36: Spectrum-based  de novo  repeat detection in genomic sequences

PROGRAMS EulerAlign by Zhang and Waterman PALS by Edgar and Myers REPuter by Kurtz et al. SARGRI

Page 37: Spectrum-based  de novo  repeat detection in genomic sequences

MEASUREMENT Count Ratio (CR): the ratio of number of pairs of repeat

share more than 50% with a reference pair to the number of reference pairs.

Shared Repeat Region (SRR): the ratio of the found region to the reference region.

Page 38: Spectrum-based  de novo  repeat detection in genomic sequences

SIMULATED DATA

Conclusion from simulated dataThe result is consistent with the analysis

Page 39: Spectrum-based  de novo  repeat detection in genomic sequences

GENOME DATA M.gen (0.6 Mbp)

Organism with the smallest genome Lives in the primate genital and respiratory tracts

C.tra (1 Mbp) Live inside the cells of humans

A.ful (2.1 Mbp) Found in high-temperature oil fields

E.coli (4 Mbp) An import bacteria live inside lower intestines of mammals

Human chr22 p20M to p21M (1Mbp)

Page 40: Spectrum-based  de novo  repeat detection in genomic sequences

Use CR and SRR ratio to measure

Cross validation G/H=1, H/G<1 G “outperforms” H G/H<1, H/G=1 H “outperforms” G G/H<1, H/G<1 G, H are complementary G/H=1, H/G=1 G, H are similar

Page 41: Spectrum-based  de novo  repeat detection in genomic sequences
Page 42: Spectrum-based  de novo  repeat detection in genomic sequences

=

Page 43: Spectrum-based  de novo  repeat detection in genomic sequences

QUESTIONS AND ANSWERS

Page 44: Spectrum-based  de novo  repeat detection in genomic sequences
Page 45: Spectrum-based  de novo  repeat detection in genomic sequences

H. H. Do, K. P. Choi, F. P. Preparata, W. K. Sung, L. Zhang. Spectrum-based de novo repeat detection in genomic sequences. Journal of Computational Biology, 15(5):469-487, June 2008