a robust framework for detecting structural variations
DESCRIPTION
A Robust Framework for Detecting Structural Variations. February 6, 2008 Seunghak Lee 1 , Elango Cheran 1 , and Michael Brudno 1 1 University of Toronto, Canada. What are structural variations? (1). 10^3 – 10^6 basepair variations in the genome - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: A Robust Framework for Detecting Structural Variations](https://reader035.vdocuments.mx/reader035/viewer/2022062322/56814759550346895db49939/html5/thumbnails/1.jpg)
1
A Robust Framework for Detecting Structural Variations
February 6, 2008
Seunghak Lee1, Elango Cheran1, and Michael Brudno1
1University of Toronto, Canada
![Page 2: A Robust Framework for Detecting Structural Variations](https://reader035.vdocuments.mx/reader035/viewer/2022062322/56814759550346895db49939/html5/thumbnails/2.jpg)
2
What are structural variations? (1)
10^3 – 10^6 basepair variations in the genome
Insertion: a large consecutive fragment of DNA is inserted
Deletion: a large consecutive fragment of DNA is deleted
Inversion: a large consecutive fragment of DNA is inversed
Translocation: a large consecutive fragment of DNA is moved from one chromosome to another.
Copy number variations
![Page 3: A Robust Framework for Detecting Structural Variations](https://reader035.vdocuments.mx/reader035/viewer/2022062322/56814759550346895db49939/html5/thumbnails/3.jpg)
3
What are structural variations? (2)
Various examples of structural variations
![Page 4: A Robust Framework for Detecting Structural Variations](https://reader035.vdocuments.mx/reader035/viewer/2022062322/56814759550346895db49939/html5/thumbnails/4.jpg)
4
Outline
Introduction Type of Structural Variations Sequencing Approaches to Detect Structural Variations Motivation & Research Objectives
Probabilistic Framework for Detecting Structural Variations Probabilistic Framework Flow of our Framework Hierarchical Clustering of Matepairs (2nd phase) Choosing a Unique Mapped Location for Each Matepair (3nd phase)
Experiments Comparison with Three Previous research DMBT1 Gene for Deletion Centromere and Translocations
Conclusions
![Page 5: A Robust Framework for Detecting Structural Variations](https://reader035.vdocuments.mx/reader035/viewer/2022062322/56814759550346895db49939/html5/thumbnails/5.jpg)
5
Type of Structural Variations (1)
Insertion
A
REF
![Page 6: A Robust Framework for Detecting Structural Variations](https://reader035.vdocuments.mx/reader035/viewer/2022062322/56814759550346895db49939/html5/thumbnails/6.jpg)
6
Type of Structural Variations (2)
Deletion
A
REF
![Page 7: A Robust Framework for Detecting Structural Variations](https://reader035.vdocuments.mx/reader035/viewer/2022062322/56814759550346895db49939/html5/thumbnails/7.jpg)
7
Type of Structural Variations (3)
Inversion
A
REF
5’ 3’
5’ 3’
5’3’
![Page 8: A Robust Framework for Detecting Structural Variations](https://reader035.vdocuments.mx/reader035/viewer/2022062322/56814759550346895db49939/html5/thumbnails/8.jpg)
8
Type of Structural Variations (4)
Translocation
chr1
chr2
![Page 9: A Robust Framework for Detecting Structural Variations](https://reader035.vdocuments.mx/reader035/viewer/2022062322/56814759550346895db49939/html5/thumbnails/9.jpg)
9
Sequencing Approaches
1. “Fine-scale structural variation of the human genome” [Tuzun et al, 2005]
• Mapping matepairs onto the reference genome • Insertion and deletion: inconsistent mapped distance• Inversion: the same orientation of both reads
2. “Paired-End mappings Reveals Extensive Structural Variation in the Human Genome” [Korbel et al, 2007]
• Proposed high-throughput and massive paired end mapping technique• Detailed types of structural variations
![Page 10: A Robust Framework for Detecting Structural Variations](https://reader035.vdocuments.mx/reader035/viewer/2022062322/56814759550346895db49939/html5/thumbnails/10.jpg)
10
Motivation & Research Objectives (1)
Tuzun et al used scores which are the combination of several factors. (e.g. length, identity, quality of the sequences)
How can we map reads onto the reference genome?
![Page 11: A Robust Framework for Detecting Structural Variations](https://reader035.vdocuments.mx/reader035/viewer/2022062322/56814759550346895db49939/html5/thumbnails/11.jpg)
11
Motivation & Research Objectives (2)
Sequencing method is effective to detect structural variants. Proven by Tuzun et al, Korbel et al
However, there are multiple mappings for each read Previous research used a priori mapped locations.
Why don’t we develop a probabilistic model without such assumptions? Hopefully, it can be applied to short reads from NGS machines.
![Page 12: A Robust Framework for Detecting Structural Variations](https://reader035.vdocuments.mx/reader035/viewer/2022062322/56814759550346895db49939/html5/thumbnails/12.jpg)
12
Probabilistic Framework (1)
p(Y): distribution of mapped distances of “uniquely mapped” matepairs of various sizes
We play with p(Y) to describe our probabilistic framework
![Page 13: A Robust Framework for Detecting Structural Variations](https://reader035.vdocuments.mx/reader035/viewer/2022062322/56814759550346895db49939/html5/thumbnails/13.jpg)
13
Probabilistic Framework (2)
Insertion
μY = (s+r)
P(Xi, Xj|ins=r) = P(Xi|ins=r)P(Xj|ins=r)P(Xi|ins=r) = 1 - P(μY - δ ≤Y≤μy+ δ)
, where δ= |μY- (s+r)|, s = mapped distance
μy - δ
X1, X2 = matepair 1,2Y= random variable for mapped distances of “uniquely mapped” matepairs
p(Y)
![Page 14: A Robust Framework for Detecting Structural Variations](https://reader035.vdocuments.mx/reader035/viewer/2022062322/56814759550346895db49939/html5/thumbnails/14.jpg)
14
Probabilistic Framework (3)
Deletion
μY = (s-r)
P(Xi, Xj|del=r) = P(Xi|del=r)P(Xj|del=r)P(Xi|del=r) = 1 - P(μY - δ ≤Y≤μy+ δ)
where δ= |μY- (s-r)|, s = mapped distance
μy - δ
p(Y)
![Page 15: A Robust Framework for Detecting Structural Variations](https://reader035.vdocuments.mx/reader035/viewer/2022062322/56814759550346895db49939/html5/thumbnails/15.jpg)
15
Probabilistic Framework (4)
c - d = s(X1) - s(X2)
P(Xi, Xj|inv) = 1 - P(μ|Y1-Y2| - δ ≤|Y1-Y2|≤μ|Y1-Y2| + δ) where δ= |μ|Y1-Y2| – (c – d)|, s(Xi) = insert size of Xi
μ|Y1-Y2|-δ
p(|Y1-Y2|)
Inversion
![Page 16: A Robust Framework for Detecting Structural Variations](https://reader035.vdocuments.mx/reader035/viewer/2022062322/56814759550346895db49939/html5/thumbnails/16.jpg)
16
Probabilistic Framework (5)
μ|Y1-Y2|-δ(c – a) – (d – b) = s(X1) - s(X2)
P(Xi, Xj|trans) = 1 - P(μ|Y1-Y2| - δ ≤|Y1-Y2|≤μ|Y1-Y2| + δ) , where δ= |μ|Y1-Y2| – (c – a) – (d – b) |, s(Xi) = insert size of Xi
p(|Y1-Y2|)
Translocation
![Page 17: A Robust Framework for Detecting Structural Variations](https://reader035.vdocuments.mx/reader035/viewer/2022062322/56814759550346895db49939/html5/thumbnails/17.jpg)
17
Flow of our Framework (1)
1. Preprocessing step
Get top K Get top K mappings mappings Get top K Get top K mappings mappings
Remove Remove short mappingsshort mappingsRemove Remove short mappingsshort mappings
Make all possible Make all possible combinations of combinations of
mappingsmappings
Make all possible Make all possible combinations of combinations of
mappingsmappings
Discard matepairs consistent Discard matepairs consistent with insert sizewith insert size
Discard matepairs consistent Discard matepairs consistent with insert sizewith insert size
Remove invalid Remove invalid strands (-,+) strands (-,+) Remove invalid Remove invalid strands (-,+) strands (-,+)
Remove very Remove very similar similar
mappingsmappings
Remove very Remove very similar similar
mappingsmappingsMask Mask repeatsrepeatsMask Mask repeatsrepeats
![Page 18: A Robust Framework for Detecting Structural Variations](https://reader035.vdocuments.mx/reader035/viewer/2022062322/56814759550346895db49939/html5/thumbnails/18.jpg)
18
Flow of our Framework (2)
2. Clustering
3. Finding structural variations
Do hierarchical clustering for each structural variationDo hierarchical clustering for each structural variation(Insertion, Deletion, Inversion, Translocation)(Insertion, Deletion, Inversion, Translocation)
Do hierarchical clustering for each structural variationDo hierarchical clustering for each structural variation(Insertion, Deletion, Inversion, Translocation)(Insertion, Deletion, Inversion, Translocation)
Find a local Find a local optimum optimum configurationconfiguration
Find a local Find a local optimum optimum configurationconfiguration
Parameter learning Parameter learning for the objective for the objective functionfunction
Parameter learning Parameter learning for the objective for the objective functionfunction
Find initial Find initial configuration in configuration in greedy manner greedy manner
Find initial Find initial configuration in configuration in greedy manner greedy manner
![Page 19: A Robust Framework for Detecting Structural Variations](https://reader035.vdocuments.mx/reader035/viewer/2022062322/56814759550346895db49939/html5/thumbnails/19.jpg)
19
Hierarchical Clustering (1)
(ex) Insertion
A
REF
•Cluster, C, is a set of matepairs explaining the same structural variations•Linkage distance = D(X1, X2) = - ln P(X1, X2|C)
X1X2
X1X2
C={X1, X2}
![Page 20: A Robust Framework for Detecting Structural Variations](https://reader035.vdocuments.mx/reader035/viewer/2022062322/56814759550346895db49939/html5/thumbnails/20.jpg)
20
Hierarchical Clustering (2)
Generally, linkage distance is given by,
We do hierarchical clustering for each structural variation.
![Page 21: A Robust Framework for Detecting Structural Variations](https://reader035.vdocuments.mx/reader035/viewer/2022062322/56814759550346895db49939/html5/thumbnails/21.jpg)
21
Choosing a Unique Mapped Location (1)
We should map matepairs onto unique pair of BLAT hits and unique cluster.
R1 R2
C2C1 C2C1
R2R1
1 2 3 4 5
M1,4 M2,4 M3,5
![Page 22: A Robust Framework for Detecting Structural Variations](https://reader035.vdocuments.mx/reader035/viewer/2022062322/56814759550346895db49939/html5/thumbnails/22.jpg)
22
Choosing a Unique Mapped Location (2)
We define a objective Function J(ω)
ƒ1 corresponds to BLAT hit scores
ƒ2 corresponds to the probability
ƒ3 corresponds to the size of clusters
![Page 23: A Robust Framework for Detecting Structural Variations](https://reader035.vdocuments.mx/reader035/viewer/2022062322/56814759550346895db49939/html5/thumbnails/23.jpg)
23
Choosing a Unique Mapped Location (3)
Find the initial configuration greedily
Learn parameters for the objective function J(ω). We used hill climbing search to maximize the l
og likelihood of P(ω|λi)
Finally, find a configuration, locally maximizing J(ω) using hill climbing search
![Page 24: A Robust Framework for Detecting Structural Variations](https://reader035.vdocuments.mx/reader035/viewer/2022062322/56814759550346895db49939/html5/thumbnails/24.jpg)
24
P-values
We assign p-values to give confidence to our clusters.
The probability that the cluster is generated by the reference genome not by structural variants Pval(Ck)=(E choose |Ck|) ∏ P(Xi|Cnull)
where E = (Expected number of matepairs mapped to the location of the cluster)
P-values depend on the length of the cluster, thenumber of matepairs involved and probabilities.
![Page 25: A Robust Framework for Detecting Structural Variations](https://reader035.vdocuments.mx/reader035/viewer/2022062322/56814759550346895db49939/html5/thumbnails/25.jpg)
25
Clustering Results
We started with ~360,000 matepair ~90% were uniquely mapped ~90% had a concordant position (mapped at ± 2)
Through the clustering procedure above (FDR 0.2) we found
82 Insertion clusters (53 had a uniquely mapped read) 175 Deletion clusters (135) 103 inversion clusters (24) 55 Translocation (cross-chromosome) cluster
(all were required to have a uniquely mapped read)
![Page 26: A Robust Framework for Detecting Structural Variations](https://reader035.vdocuments.mx/reader035/viewer/2022062322/56814759550346895db49939/html5/thumbnails/26.jpg)
26
Example Deletion
![Page 27: A Robust Framework for Detecting Structural Variations](https://reader035.vdocuments.mx/reader035/viewer/2022062322/56814759550346895db49939/html5/thumbnails/27.jpg)
27
Agreement with Previous Results
Type Total Tuzun Levy Korbel DGV-All
Insertion 82(53) 12(7)/139 6(5)/319 0(0)/34 24(13)/2216
Deletion 175(135) 21(17)/102 25(23)/344 45(36)/742 82(63)/4697
Inversion 103(24) 34(12)/56 N/A 42(8)/105 60(15)/164
We have compared
All of the correlations (besides the zero) are significant (p-values < 0.001) via Monte Carlosimulations
The DMBT1 deletion was also found in theTuzun et al dataset (but not the Levy dataset).
![Page 28: A Robust Framework for Detecting Structural Variations](https://reader035.vdocuments.mx/reader035/viewer/2022062322/56814759550346895db49939/html5/thumbnails/28.jpg)
28
Translocations
A large fraction (69%) of the translocations were close to the centromeres
She et al. predicted up to 200 interchromosomal rearrangement events near centromeres per million years. The two donors are ~0.2 million years apart
These could also be mis-assemblies.
Distance to centromere
<106 (106, 4.5*106] >4.5*106
<106 22 6 10
(106, 4.5*106] 0 3
>4.5*106 14
![Page 29: A Robust Framework for Detecting Structural Variations](https://reader035.vdocuments.mx/reader035/viewer/2022062322/56814759550346895db49939/html5/thumbnails/29.jpg)
29
Conclusions
Introduced a novel framework for finding structural variants that does not rely on ab initio mapping of matepairs to genomic positions.
Introduced a probabilistic model for structural variants
Isolated 82 insertions, 175 deletions, and 103 inversions between the reference public human genome and the JCVI donor.
These results show statistically significant correlation with previous variation studies
Isolated 194 novel structural variants that do not overlap any event from the database of genomic variants (of these 121 have support from a uniquely mapped matepair)