genome assembly, a universal theoretical framework ...theoretical framework: unifying and...

57
Genome assembly, a universal theoretical framework: unifying and generalizing the safe and complete algorithms Massimo Cairo, Shahbaz Khan, Romeo Rizzi, Sebastian Schmidt , Alexandru I. Tomescu and Elia C. Zirondelli Genome assembly, a universal theoretical framework / Sebastian Schmidt February 11, 2021 0 / 16

Upload: others

Post on 27-Jul-2021

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Genome assembly, a universal theoretical framework ...theoretical framework: unifying and generalizing the safe and complete algorithms Massimo Cairo, Shahbaz Khan, Romeo Rizzi, Sebastian

Genome assembly, a universaltheoretical framework: unifying andgeneralizing the safe and complete

algorithmsMassimo Cairo, Shahbaz Khan, Romeo Rizzi, Sebastian Schmidt,

Alexandru I. Tomescu and Elia C. Zirondelli

Genome assembly, a universal theoretical framework / Sebastian Schmidt February 11, 2021 0 / 16

Page 2: Genome assembly, a universal theoretical framework ...theoretical framework: unifying and generalizing the safe and complete algorithms Massimo Cairo, Shahbaz Khan, Romeo Rizzi, Sebastian

Outline

1 Genome Assembly & Safety

2 Current Safe and Complete Algorithms

3 Unifying the Theory: the Hydrostructure (Cairo et al., 2020a)

4 Hydrostructure Algorithms

Genome assembly, a universal theoretical framework / Sebastian Schmidt February 11, 2021 1 / 16

Page 3: Genome assembly, a universal theoretical framework ...theoretical framework: unifying and generalizing the safe and complete algorithms Massimo Cairo, Shahbaz Khan, Romeo Rizzi, Sebastian

Genome Assembly & Safety

Genome assembly, a universal theoretical framework / Sebastian Schmidt February 11, 2021

Page 4: Genome assembly, a universal theoretical framework ...theoretical framework: unifying and generalizing the safe and complete algorithms Massimo Cairo, Shahbaz Khan, Romeo Rizzi, Sebastian

Models of genome assembly

Genome graph:• Used by almost all modern

assemblers• Each edge is part of the genome• The genome is a walk

Additionalinformation:• There is a single circular genome

Genome assembly, a universal theoretical framework / Sebastian Schmidt February 11, 2021 2 / 16

Page 5: Genome assembly, a universal theoretical framework ...theoretical framework: unifying and generalizing the safe and complete algorithms Massimo Cairo, Shahbaz Khan, Romeo Rizzi, Sebastian

Models of genome assembly

Genome graph:• Used by almost all modern

assemblers• Each edge is part of the genome• The genome is a walk

Additional information:• There is a single circular genome

Genome assembly, a universal theoretical framework / Sebastian Schmidt February 11, 2021 2 / 16

Page 6: Genome assembly, a universal theoretical framework ...theoretical framework: unifying and generalizing the safe and complete algorithms Massimo Cairo, Shahbaz Khan, Romeo Rizzi, Sebastian

Models of genome assembly

Genome graph:• Used by almost all modern

assemblers• Each edge is part of the genome• The genome is a walk

Additional information:• There are multiple circular

genomes

Genome assembly, a universal theoretical framework / Sebastian Schmidt February 11, 2021 2 / 16

Page 7: Genome assembly, a universal theoretical framework ...theoretical framework: unifying and generalizing the safe and complete algorithms Massimo Cairo, Shahbaz Khan, Romeo Rizzi, Sebastian

Finding the true genome

Challenge:• Infinitely many possible solutions

What is the correct one?

Solution:• Find only safe subwalks

of the true genome• But be as complete as possible

Genome assembly, a universal theoretical framework / Sebastian Schmidt February 11, 2021 3 / 16

Page 8: Genome assembly, a universal theoretical framework ...theoretical framework: unifying and generalizing the safe and complete algorithms Massimo Cairo, Shahbaz Khan, Romeo Rizzi, Sebastian

Finding the true genome

Challenge:• Infinitely many possible solutions

What is the correct one?

Solution:• Find only safe subwalks

of the true genome• But be as complete as possible

Genome assembly, a universal theoretical framework / Sebastian Schmidt February 11, 2021 3 / 16

Page 9: Genome assembly, a universal theoretical framework ...theoretical framework: unifying and generalizing the safe and complete algorithms Massimo Cairo, Shahbaz Khan, Romeo Rizzi, Sebastian

Accounting for practical issues

AATGCAGTATGCAGTCATGCAGTTACGACGTAATGCAGTATGCAGTCATGCAGTGACGACGT

Father:Mother:

Genome assembly, a universal theoretical framework / Sebastian Schmidt February 11, 2021 4 / 16

Page 10: Genome assembly, a universal theoretical framework ...theoretical framework: unifying and generalizing the safe and complete algorithms Massimo Cairo, Shahbaz Khan, Romeo Rizzi, Sebastian

Accounting for practical issues

AAT ATG TGC GCA CAG AGT

GTATAT

GTCTCACAT

GTT

GTG

TTA

TGA

TAC

GACCGA

CGT

AATGCAGTATGCAGTCATGCAGTTACGACGTAATGCAGTATGCAGTCATGCAGTGACGACGT

Father:Mother:

ACG

Genome assembly, a universal theoretical framework / Sebastian Schmidt February 11, 2021 4 / 16

Page 11: Genome assembly, a universal theoretical framework ...theoretical framework: unifying and generalizing the safe and complete algorithms Massimo Cairo, Shahbaz Khan, Romeo Rizzi, Sebastian

Accounting for practical issues

AAT ATG TGC GCA CAG AGT

GTATAT

GTCTCACAT

GTT

GTG

TTA

TGA

TAC

GACCGA

CGT

AATGCAGTATGCAGTCATGCAGTTACGACGTAATGCAGTATGCAGTCATGCAGTGACGACGT

Father:Mother:

ACG

Genome assembly, a universal theoretical framework / Sebastian Schmidt February 11, 2021 4 / 16

Page 12: Genome assembly, a universal theoretical framework ...theoretical framework: unifying and generalizing the safe and complete algorithms Massimo Cairo, Shahbaz Khan, Romeo Rizzi, Sebastian

Accounting for practical issues

AAT ATG TGC GCA CAG AGT

GTATAT

GTCTCACAT

GTT

GTG

TTA

TGA

TAC

GACCGA

CGT

AATGCAGTATGCAGTCATGCAGTTACGACGTAATGCAGTATGCAGTCATGCAGTGACGACGT

Father:Mother:

ACG

Genome assembly, a universal theoretical framework / Sebastian Schmidt February 11, 2021 4 / 16

Page 13: Genome assembly, a universal theoretical framework ...theoretical framework: unifying and generalizing the safe and complete algorithms Massimo Cairo, Shahbaz Khan, Romeo Rizzi, Sebastian

Accounting for practical issues

AAT ATG TGC AGT

TAT

TCACATGTG TGA GAC

CGA

CGT

AATGCAGTATGCAGTCATGCAGTTACGACGTAATGCAGTATGCAGTCATGCAGTGACGACGT

Father:Mother:

ACG

Genome assembly, a universal theoretical framework / Sebastian Schmidt February 11, 2021 4 / 16

Page 14: Genome assembly, a universal theoretical framework ...theoretical framework: unifying and generalizing the safe and complete algorithms Massimo Cairo, Shahbaz Khan, Romeo Rizzi, Sebastian

Accounting for practical issues

AAT ATG TGC GCA CAG AGT

GTATAT

GTCTCACAT

GTT

GTG

TTA

TGA

TAC

GACCGA

CGT

AATGCAGTATGCAGTCATGCAGTTACGACGTAATGCAGTATGCAGTCATGCAGTGACGACGT

Father:Mother:

ACG

Genome assembly, a universal theoretical framework / Sebastian Schmidt February 11, 2021 4 / 16

Page 15: Genome assembly, a universal theoretical framework ...theoretical framework: unifying and generalizing the safe and complete algorithms Massimo Cairo, Shahbaz Khan, Romeo Rizzi, Sebastian

Current Safe and Complete Algorithms

Genome assembly, a universal theoretical framework / Sebastian Schmidt February 11, 2021

Page 16: Genome assembly, a universal theoretical framework ...theoretical framework: unifying and generalizing the safe and complete algorithms Massimo Cairo, Shahbaz Khan, Romeo Rizzi, Sebastian

Properties of currentsafe and complete algorithms

Single circular, Omnitigs:characterisation (Tomescu and Medvedev, 2017)optimal O(mn) (Cairo et al., 2019)output optimal O(m + n + o) (Cairo et al., 2020b)

Multi circular:O(m2n) (Acosta, Mäkinen, and Tomescu, 2018)

Genome assembly, a universal theoretical framework / Sebastian Schmidt February 11, 2021 5 / 16

Page 17: Genome assembly, a universal theoretical framework ...theoretical framework: unifying and generalizing the safe and complete algorithms Massimo Cairo, Shahbaz Khan, Romeo Rizzi, Sebastian

Properties of currentsafe and complete algorithms

Single circular: O(mn), O(m + n + o)Multi circular: O(m2n)

But many other models are relevant:• k solution walks• linear models• partial coverage• partial visibility

Unified theory for all combinations of these?

Genome assembly, a universal theoretical framework / Sebastian Schmidt February 11, 2021 5 / 16

Page 18: Genome assembly, a universal theoretical framework ...theoretical framework: unifying and generalizing the safe and complete algorithms Massimo Cairo, Shahbaz Khan, Romeo Rizzi, Sebastian

Unifying the Theory: the Hydrostructure(Cairo et al., 2020a)

Genome assembly, a universal theoretical framework / Sebastian Schmidt February 11, 2021

Page 19: Genome assembly, a universal theoretical framework ...theoretical framework: unifying and generalizing the safe and complete algorithms Massimo Cairo, Shahbaz Khan, Romeo Rizzi, Sebastian

Unification: the hydrostructureof a walk W

• Defined on a strongly connected graph G := V ∪ E and a walk W ⊆ G

• In this talk: only non-trivial hearts of paths (without repetitions of nodes/edges)

Non-Trivial HeartLeft Wing Right Wing

Left Wing Right WingTrivial Heart

Genome assembly, a universal theoretical framework / Sebastian Schmidt February 11, 2021 6 / 16

Page 20: Genome assembly, a universal theoretical framework ...theoretical framework: unifying and generalizing the safe and complete algorithms Massimo Cairo, Shahbaz Khan, Romeo Rizzi, Sebastian

Unification: the hydrostructureof a walk W

• Defined on a strongly connected graph G := V ∪ E and a walk W ⊆ G• In this talk: only non-trivial hearts of paths (without repetitions of nodes/edges)

Non-Trivial HeartLeft Wing Right Wing

Left Wing Right WingTrivial Heart

Genome assembly, a universal theoretical framework / Sebastian Schmidt February 11, 2021 6 / 16

Page 21: Genome assembly, a universal theoretical framework ...theoretical framework: unifying and generalizing the safe and complete algorithms Massimo Cairo, Shahbaz Khan, Romeo Rizzi, Sebastian

Unification: the hydrostructureof a walk W

• Decomposes the graphinto four regions

• The connectivity betweenregions is restricted• If Vapor is a path, then W

is a Sea-Cloud bottleneck• For any model: if each

solution walk goes fromSea to Cloud, then W issafe

Genome assembly, a universal theoretical framework / Sebastian Schmidt February 11, 2021 7 / 16

Page 22: Genome assembly, a universal theoretical framework ...theoretical framework: unifying and generalizing the safe and complete algorithms Massimo Cairo, Shahbaz Khan, Romeo Rizzi, Sebastian

Unification: the hydrostructureof a walk W

• Decomposes the graphinto four regions• The connectivity between

regions is restricted

• If Vapor is a path, then Wis a Sea-Cloud bottleneck• For any model: if each

solution walk goes fromSea to Cloud, then W issafe

Genome assembly, a universal theoretical framework / Sebastian Schmidt February 11, 2021 7 / 16

Page 23: Genome assembly, a universal theoretical framework ...theoretical framework: unifying and generalizing the safe and complete algorithms Massimo Cairo, Shahbaz Khan, Romeo Rizzi, Sebastian

Unification: the hydrostructureof a walk W

• Decomposes the graphinto four regions• The connectivity between

regions is restricted• If Vapor is a path, then W

is a Sea-Cloud bottleneck

• For any model: if eachsolution walk goes fromSea to Cloud, then W issafe

Genome assembly, a universal theoretical framework / Sebastian Schmidt February 11, 2021 7 / 16

Page 24: Genome assembly, a universal theoretical framework ...theoretical framework: unifying and generalizing the safe and complete algorithms Massimo Cairo, Shahbaz Khan, Romeo Rizzi, Sebastian

Unification: the hydrostructureof a walk W

• Decomposes the graphinto four regions• The connectivity between

regions is restricted• If Vapor is a path, then W

is a Sea-Cloud bottleneck• For any model: if each

solution walk goes fromSea to Cloud, then W issafe

Genome assembly, a universal theoretical framework / Sebastian Schmidt February 11, 2021 7 / 16

Page 25: Genome assembly, a universal theoretical framework ...theoretical framework: unifying and generalizing the safe and complete algorithms Massimo Cairo, Shahbaz Khan, Romeo Rizzi, Sebastian

Definition of the hydrostructure

• W is a walkfrom edge f to edge g• R+(W ) is everything

reachable from f in G \ g

• R−(W ) is everythingbackwards-reachable fromg in G \ f• W is a Sea-Cloud

bottleneck if and only ifVapor is a path

f

g

Genome assembly, a universal theoretical framework / Sebastian Schmidt February 11, 2021 8 / 16

Page 26: Genome assembly, a universal theoretical framework ...theoretical framework: unifying and generalizing the safe and complete algorithms Massimo Cairo, Shahbaz Khan, Romeo Rizzi, Sebastian

Definition of the hydrostructure

• W is a walkfrom edge f to edge g• R+(W ) is everything

reachable from f in G \ g• R−(W ) is everything

backwards-reachable fromg in G \ f

• W is a Sea-Cloudbottleneck if and only ifVapor is a path

f

g

Genome assembly, a universal theoretical framework / Sebastian Schmidt February 11, 2021 8 / 16

Page 27: Genome assembly, a universal theoretical framework ...theoretical framework: unifying and generalizing the safe and complete algorithms Massimo Cairo, Shahbaz Khan, Romeo Rizzi, Sebastian

Definition of the hydrostructure

• W is a walkfrom edge f to edge g• R+(W ) is everything

reachable from f in G \ g• R−(W ) is everything

backwards-reachable fromg in G \ f

• W is a Sea-Cloudbottleneck if and only ifVapor is a path

f

g

Genome assembly, a universal theoretical framework / Sebastian Schmidt February 11, 2021 8 / 16

Page 28: Genome assembly, a universal theoretical framework ...theoretical framework: unifying and generalizing the safe and complete algorithms Massimo Cairo, Shahbaz Khan, Romeo Rizzi, Sebastian

Definition of the hydrostructure

• W is a walkfrom edge f to edge g• R+(W ) is everything

reachable from f in G \ g• R−(W ) is everything

backwards-reachable fromg in G \ f• W is a Sea-Cloud

bottleneck if and only ifVapor is a path

f

g

Genome assembly, a universal theoretical framework / Sebastian Schmidt February 11, 2021 8 / 16

Page 29: Genome assembly, a universal theoretical framework ...theoretical framework: unifying and generalizing the safe and complete algorithms Massimo Cairo, Shahbaz Khan, Romeo Rizzi, Sebastian

The bottleneck property

Definition 1: R+(W ) := {x ∈ G | f → x ∈ G \ g}Definition 2: R−(W ) := {x ∈ G | x → g ∈ G \ f}Definition 3: W is a Sea-Cloud bottleneck if each walk from Sea toCloud has W as subwalkLemma: W is a Sea-Cloud bottleneck ⇐⇒ Vapor is a path• f is the only way to enter R−(W ) (by Def. 2)

• g is the only way to exit R+(W ) (by Def. 1)• ⇒ Sea-Cloud walks have a head(f )-tail(g) subwalk X• A minimal one is in Vapor⇒ If W Sea-Cloud bottleneck, then fXg = W , so Vapor is a path⇐ If Vapor is a path, then fXg = W , so W is Sea-Cloud bottleneck

f

g

?

Genome assembly, a universal theoretical framework / Sebastian Schmidt February 11, 2021 9 / 16

Page 30: Genome assembly, a universal theoretical framework ...theoretical framework: unifying and generalizing the safe and complete algorithms Massimo Cairo, Shahbaz Khan, Romeo Rizzi, Sebastian

The bottleneck property

Definition 1: R+(W ) := {x ∈ G | f → x ∈ G \ g}Definition 2: R−(W ) := {x ∈ G | x → g ∈ G \ f}Definition 3: W is a Sea-Cloud bottleneck if each walk from Sea toCloud has W as subwalkLemma: W is a Sea-Cloud bottleneck ⇐⇒ Vapor is a path• f is the only way to enter R−(W ) (by Def. 2)• g is the only way to exit R+(W ) (by Def. 1)

• ⇒ Sea-Cloud walks have a head(f )-tail(g) subwalk X• A minimal one is in Vapor⇒ If W Sea-Cloud bottleneck, then fXg = W , so Vapor is a path⇐ If Vapor is a path, then fXg = W , so W is Sea-Cloud bottleneck

f

g

?

Genome assembly, a universal theoretical framework / Sebastian Schmidt February 11, 2021 9 / 16

Page 31: Genome assembly, a universal theoretical framework ...theoretical framework: unifying and generalizing the safe and complete algorithms Massimo Cairo, Shahbaz Khan, Romeo Rizzi, Sebastian

The bottleneck property

Definition 1: R+(W ) := {x ∈ G | f → x ∈ G \ g}Definition 2: R−(W ) := {x ∈ G | x → g ∈ G \ f}Definition 3: W is a Sea-Cloud bottleneck if each walk from Sea toCloud has W as subwalkLemma: W is a Sea-Cloud bottleneck ⇐⇒ Vapor is a path• f is the only way to enter R−(W ) (by Def. 2)• g is the only way to exit R+(W ) (by Def. 1)• ⇒ Sea-Cloud walks have a head(f )-tail(g) subwalk X

• A minimal one is in Vapor⇒ If W Sea-Cloud bottleneck, then fXg = W , so Vapor is a path⇐ If Vapor is a path, then fXg = W , so W is Sea-Cloud bottleneck

f

g

?

Genome assembly, a universal theoretical framework / Sebastian Schmidt February 11, 2021 9 / 16

Page 32: Genome assembly, a universal theoretical framework ...theoretical framework: unifying and generalizing the safe and complete algorithms Massimo Cairo, Shahbaz Khan, Romeo Rizzi, Sebastian

The bottleneck property

Definition 1: R+(W ) := {x ∈ G | f → x ∈ G \ g}Definition 2: R−(W ) := {x ∈ G | x → g ∈ G \ f}Definition 3: W is a Sea-Cloud bottleneck if each walk from Sea toCloud has W as subwalkLemma: W is a Sea-Cloud bottleneck ⇐⇒ Vapor is a path• f is the only way to enter R−(W ) (by Def. 2)• g is the only way to exit R+(W ) (by Def. 1)• ⇒ Sea-Cloud walks have a head(f )-tail(g) subwalk X• A minimal one is in Vapor

⇒ If W Sea-Cloud bottleneck, then fXg = W , so Vapor is a path⇐ If Vapor is a path, then fXg = W , so W is Sea-Cloud bottleneck

f

g

?

Genome assembly, a universal theoretical framework / Sebastian Schmidt February 11, 2021 9 / 16

Page 33: Genome assembly, a universal theoretical framework ...theoretical framework: unifying and generalizing the safe and complete algorithms Massimo Cairo, Shahbaz Khan, Romeo Rizzi, Sebastian

The bottleneck property

Definition 1: R+(W ) := {x ∈ G | f → x ∈ G \ g}Definition 2: R−(W ) := {x ∈ G | x → g ∈ G \ f}Definition 3: W is a Sea-Cloud bottleneck if each walk from Sea toCloud has W as subwalkLemma: W is a Sea-Cloud bottleneck ⇐⇒ Vapor is a path• f is the only way to enter R−(W ) (by Def. 2)• g is the only way to exit R+(W ) (by Def. 1)• ⇒ Sea-Cloud walks have a head(f )-tail(g) subwalk X• A minimal one is in Vapor⇒ If W Sea-Cloud bottleneck, then fXg = W , so Vapor is a path

⇐ If Vapor is a path, then fXg = W , so W is Sea-Cloud bottleneck

f

g

?

Genome assembly, a universal theoretical framework / Sebastian Schmidt February 11, 2021 9 / 16

Page 34: Genome assembly, a universal theoretical framework ...theoretical framework: unifying and generalizing the safe and complete algorithms Massimo Cairo, Shahbaz Khan, Romeo Rizzi, Sebastian

The bottleneck property

Definition 1: R+(W ) := {x ∈ G | f → x ∈ G \ g}Definition 2: R−(W ) := {x ∈ G | x → g ∈ G \ f}Definition 3: W is a Sea-Cloud bottleneck if each walk from Sea toCloud has W as subwalkLemma: W is a Sea-Cloud bottleneck ⇐⇒ Vapor is a path• f is the only way to enter R−(W ) (by Def. 2)• g is the only way to exit R+(W ) (by Def. 1)• ⇒ Sea-Cloud walks have a head(f )-tail(g) subwalk X• A minimal one is in Vapor⇒ If W Sea-Cloud bottleneck, then fXg = W , so Vapor is a path⇐ If Vapor is a path, then fXg = W , so W is Sea-Cloud bottleneck

f

g

?

Genome assembly, a universal theoretical framework / Sebastian Schmidt February 11, 2021 9 / 16

Page 35: Genome assembly, a universal theoretical framework ...theoretical framework: unifying and generalizing the safe and complete algorithms Massimo Cairo, Shahbaz Khan, Romeo Rizzi, Sebastian

Safety problems asSea-Cloud connectivity problems

• For any model: if eachsolution walk goes fromSea to Cloud, then W issafe

• Simplifies single circular• Simplifies multi circular• Simplifies single/multi

linear• Simplifies subset covering• Simplifies . . .

Genome assembly, a universal theoretical framework / Sebastian Schmidt February 11, 2021 10 / 16

Page 36: Genome assembly, a universal theoretical framework ...theoretical framework: unifying and generalizing the safe and complete algorithms Massimo Cairo, Shahbaz Khan, Romeo Rizzi, Sebastian

Safety problems asSea-Cloud connectivity problems

• For any model: if eachsolution walk goes fromSea to Cloud, then W issafe• Simplifies single circular

• Simplifies multi circular• Simplifies single/multi

linear• Simplifies subset covering• Simplifies . . .

Genome assembly, a universal theoretical framework / Sebastian Schmidt February 11, 2021 10 / 16

Page 37: Genome assembly, a universal theoretical framework ...theoretical framework: unifying and generalizing the safe and complete algorithms Massimo Cairo, Shahbaz Khan, Romeo Rizzi, Sebastian

Safety problems asSea-Cloud connectivity problems

• For any model: if eachsolution walk goes fromSea to Cloud, then W issafe• Simplifies single circular• Simplifies multi circular

• Simplifies single/multilinear• Simplifies subset covering• Simplifies . . .

Genome assembly, a universal theoretical framework / Sebastian Schmidt February 11, 2021 10 / 16

Page 38: Genome assembly, a universal theoretical framework ...theoretical framework: unifying and generalizing the safe and complete algorithms Massimo Cairo, Shahbaz Khan, Romeo Rizzi, Sebastian

Safety problems asSea-Cloud connectivity problems

• For any model: if eachsolution walk goes fromSea to Cloud, then W issafe• Simplifies single circular• Simplifies multi circular• Simplifies single/multi

linear

• Simplifies subset covering• Simplifies . . .

Genome assembly, a universal theoretical framework / Sebastian Schmidt February 11, 2021 10 / 16

Page 39: Genome assembly, a universal theoretical framework ...theoretical framework: unifying and generalizing the safe and complete algorithms Massimo Cairo, Shahbaz Khan, Romeo Rizzi, Sebastian

Safety problems asSea-Cloud connectivity problems

• For any model: if eachsolution walk goes fromSea to Cloud, then W issafe• Simplifies single circular• Simplifies multi circular• Simplifies single/multi

linear• Simplifies subset covering

• Simplifies . . .

Genome assembly, a universal theoretical framework / Sebastian Schmidt February 11, 2021 10 / 16

Page 40: Genome assembly, a universal theoretical framework ...theoretical framework: unifying and generalizing the safe and complete algorithms Massimo Cairo, Shahbaz Khan, Romeo Rizzi, Sebastian

Safety problems asSea-Cloud connectivity problems

• For any model: if eachsolution walk goes fromSea to Cloud, then W issafe• Simplifies single circular• Simplifies multi circular• Simplifies single/multi

linear• Simplifies subset covering• Simplifies . . .

Genome assembly, a universal theoretical framework / Sebastian Schmidt February 11, 2021 10 / 16

Page 41: Genome assembly, a universal theoretical framework ...theoretical framework: unifying and generalizing the safe and complete algorithms Massimo Cairo, Shahbaz Khan, Romeo Rizzi, Sebastian

Hydrostructure Algorithms

Genome assembly, a universal theoretical framework / Sebastian Schmidt February 11, 2021

Page 42: Genome assembly, a universal theoretical framework ...theoretical framework: unifying and generalizing the safe and complete algorithms Massimo Cairo, Shahbaz Khan, Romeo Rizzi, Sebastian

Algorithmic properties:verification

O(m) construction:• O(m) verification algorithms

(O(mn) for linear 2 ≤ k ≤ O(n), subset visibility)

Genome assembly, a universal theoretical framework / Sebastian Schmidt February 11, 2021 11 / 16

Page 43: Genome assembly, a universal theoretical framework ...theoretical framework: unifying and generalizing the safe and complete algorithms Massimo Cairo, Shahbaz Khan, Romeo Rizzi, Sebastian

Algorithmic properties:enumeration

O(m) incremental construction:• O(m + o) enumeration of safe subwalks of a given bottleneck walk

(O(mn) for linear 2 ≤ k ≤ O(n), subset visibility)

Safe walks are subwalks of omnitigs:• Omnitig enumeration takes O(mn) time and only O(n) of them are bottlenecks• Optimal O(mn + o) enumeration of all maximal safe walks

(O(m2n) for linear 2 ≤ k ≤ O(n))

Genome assembly, a universal theoretical framework / Sebastian Schmidt February 11, 2021 12 / 16

Page 44: Genome assembly, a universal theoretical framework ...theoretical framework: unifying and generalizing the safe and complete algorithms Massimo Cairo, Shahbaz Khan, Romeo Rizzi, Sebastian

Algorithmic properties:enumeration

O(m) incremental construction:• O(m + o) enumeration of safe subwalks of a given bottleneck walk

(O(mn) for linear 2 ≤ k ≤ O(n), subset visibility)

Safe walks are subwalks of omnitigs:• Omnitig enumeration takes O(mn) time and only O(n) of them are bottlenecks• Optimal O(mn + o) enumeration of all maximal safe walks

(O(m2n) for linear 2 ≤ k ≤ O(n))

Genome assembly, a universal theoretical framework / Sebastian Schmidt February 11, 2021 12 / 16

Page 45: Genome assembly, a universal theoretical framework ...theoretical framework: unifying and generalizing the safe and complete algorithms Massimo Cairo, Shahbaz Khan, Romeo Rizzi, Sebastian

Algorithmic properties:enumeration

O(m) incremental construction:• O(m + o) enumeration of safe subwalks of a given bottleneck walk

(O(mn) for linear 2 ≤ k ≤ O(n), subset visibility)

Safe walks are subwalks of omnitigs:• Omnitig enumeration takes O(mn) time and only O(n) of them are bottlenecks• Optimal O(mn + o) enumeration of all maximal safe walks

(O(m2n) for linear 2 ≤ k ≤ O(n))

Genome assembly, a universal theoretical framework / Sebastian Schmidt February 11, 2021 12 / 16

Page 46: Genome assembly, a universal theoretical framework ...theoretical framework: unifying and generalizing the safe and complete algorithms Massimo Cairo, Shahbaz Khan, Romeo Rizzi, Sebastian

Algorithmic properties:enumeration

O(m) incremental construction:• O(m + o) enumeration of safe subwalks of a given bottleneck walk

(O(mn) for linear 2 ≤ k ≤ O(n), subset visibility)

Safe walks are subwalks of omnitigs:• Omnitig enumeration takes O(mn) time and only O(n) of them are bottlenecks• Optimal O(mn + o) enumeration of all maximal safe walks

(O(m2n) for linear 2 ≤ k ≤ O(n))

Genome assembly, a universal theoretical framework / Sebastian Schmidt February 11, 2021 12 / 16

Page 47: Genome assembly, a universal theoretical framework ...theoretical framework: unifying and generalizing the safe and complete algorithms Massimo Cairo, Shahbaz Khan, Romeo Rizzi, Sebastian

Algorithmic properties:enumeration

O(m) incremental construction:• O(m + o) enumeration of safe subwalks of a given bottleneck walk

(O(mn) for linear 2 ≤ k ≤ O(n), subset visibility)

Safe walks are subwalks of omnitigs:• Omnitig enumeration takes O(mn) time and only O(n) of them are bottlenecks• Optimal O(mn + o) enumeration of all maximal safe walks

(O(m2n) for linear 2 ≤ k ≤ O(n))

Genome assembly, a universal theoretical framework / Sebastian Schmidt February 11, 2021 12 / 16

Page 48: Genome assembly, a universal theoretical framework ...theoretical framework: unifying and generalizing the safe and complete algorithms Massimo Cairo, Shahbaz Khan, Romeo Rizzi, Sebastian

Algorithmic properties:enumeration

O(m) incremental construction:• O(m + o) enumeration of safe subwalks of a given bottleneck walk

(O(mn) for linear 2 ≤ k ≤ O(n), subset visibility)Safe walks are subwalks of omnitigs:• Omnitig enumeration takes O(mn) time and only O(n) of them are bottlenecks• Optimal O(mn + o) enumeration of all maximal safe walks

(O(m2n) for linear 2 ≤ k ≤ O(n))Genome assembly, a universal theoretical framework / Sebastian Schmidt February 11, 2021 12 / 16

Page 49: Genome assembly, a universal theoretical framework ...theoretical framework: unifying and generalizing the safe and complete algorithms Massimo Cairo, Shahbaz Khan, Romeo Rizzi, Sebastian

Recap/Conclusion

• The hydrostructure unifies the theory of safe and complete genome assembly• The hydrostructure yields optimal algorithms for most model combinations

Genome assembly, a universal theoretical framework / Sebastian Schmidt February 11, 2021 13 / 16

Page 50: Genome assembly, a universal theoretical framework ...theoretical framework: unifying and generalizing the safe and complete algorithms Massimo Cairo, Shahbaz Khan, Romeo Rizzi, Sebastian

References I

Acosta, Nidia Obscura, Veli Mäkinen, and Alexandru I. Tomescu (2018). “A safe andcomplete algorithm for metagenomic assembly”. In: Algorithms for MolecularBiology 13.1, 3:1–3:12. DOI: 10.1186/s13015-018-0122-7. URL:https://doi.org/10.1186/s13015-018-0122-7.

Cairo, Massimo et al. (2019). “An Optimal O(nm) Algorithm for Enumerating AllWalks Common to All Closed Edge-covering Walks of a Graph”. In: ACM Trans.Algorithms 15.4, 48:1–48:17. DOI: 10.1145/3341731. URL:https://doi.org/10.1145/3341731.

Cairo, Massimo et al. (2020a). “Genome assembly, a universal theoreticalframework: unifying and generalizing the safe and complete algorithms”. In: arXivpreprint arXiv:2011.12635.

Genome assembly, a universal theoretical framework / Sebastian Schmidt February 11, 2021 14 / 16

Page 51: Genome assembly, a universal theoretical framework ...theoretical framework: unifying and generalizing the safe and complete algorithms Massimo Cairo, Shahbaz Khan, Romeo Rizzi, Sebastian

References II

Cairo, Massimo et al. (2020b). “Genome assembly, from practice to theory: safe,complete and linear-time”. In: arXiv preprint arXiv:2002.10498.

Tomescu, Alexandru I. and Paul Medvedev (2017). “Safe and complete contigassembly through omnitigs”. In: Journal of Computational Biology 24.6.Preliminary version appeared in RECOMB 2016., pp. 590–602.

Genome assembly, a universal theoretical framework / Sebastian Schmidt February 11, 2021 15 / 16

Page 52: Genome assembly, a universal theoretical framework ...theoretical framework: unifying and generalizing the safe and complete algorithms Massimo Cairo, Shahbaz Khan, Romeo Rizzi, Sebastian

Thank you for attending!Questions?

f

g

• The hydrostructure unifies the theory of safe and complete genome assembly• The hydrostructure yields optimal algorithms for most model combinations

Genome assembly, a universal theoretical framework / Sebastian Schmidt February 11, 2021 16 / 16

Page 53: Genome assembly, a universal theoretical framework ...theoretical framework: unifying and generalizing the safe and complete algorithms Massimo Cairo, Shahbaz Khan, Romeo Rizzi, Sebastian

Safety in the single circular model

Unitigs

Y-to-V Unitigs(excerpt)

Omnitigs (excerpt)(Tomescu and

Medvedev, 2017)

Genome assembly, a universal theoretical framework / Sebastian Schmidt February 11, 2021 17 / 16

Page 54: Genome assembly, a universal theoretical framework ...theoretical framework: unifying and generalizing the safe and complete algorithms Massimo Cairo, Shahbaz Khan, Romeo Rizzi, Sebastian

Safety in the single circular model

Unitigs Y-to-V Unitigs(excerpt)

Omnitigs (excerpt)(Tomescu and

Medvedev, 2017)

Genome assembly, a universal theoretical framework / Sebastian Schmidt February 11, 2021 17 / 16

Page 55: Genome assembly, a universal theoretical framework ...theoretical framework: unifying and generalizing the safe and complete algorithms Massimo Cairo, Shahbaz Khan, Romeo Rizzi, Sebastian

Safety in the single circular model

Unitigs Y-to-V Unitigs(excerpt)

Omnitigs (excerpt)(Tomescu and

Medvedev, 2017)Genome assembly, a universal theoretical framework / Sebastian Schmidt February 11, 2021 17 / 16

Page 56: Genome assembly, a universal theoretical framework ...theoretical framework: unifying and generalizing the safe and complete algorithms Massimo Cairo, Shahbaz Khan, Romeo Rizzi, Sebastian

Algorithmic Properties

Verification:• O(m) verification algorithms

(O(mn) for linear2 ≤ k ≤ O(n) subsetvisibility)

Enumeration:• Optimal O(mn + o)

enumeration ofall maximal safe walks(O(m2n) for linear2 ≤ k ≤ O(n))

Genome assembly, a universal theoretical framework / Sebastian Schmidt February 11, 2021 18 / 16

Page 57: Genome assembly, a universal theoretical framework ...theoretical framework: unifying and generalizing the safe and complete algorithms Massimo Cairo, Shahbaz Khan, Romeo Rizzi, Sebastian

Algorithmic Properties

Verification:• O(m) verification algorithms

(O(mn) for linear2 ≤ k ≤ O(n) subsetvisibility)

Enumeration:• Optimal O(mn + o)

enumeration ofall maximal safe walks(O(m2n) for linear2 ≤ k ≤ O(n))

Genome assembly, a universal theoretical framework / Sebastian Schmidt February 11, 2021 18 / 16