optimization problems for polymorphisms of single nucleotides
Post on 27-Dec-2015
214 Views
Preview:
TRANSCRIPT
Optimization Problems for Optimization Problems for
Polymorphisms of Single Polymorphisms of Single NucleotidesNucleotides
PolymorphismsPolymorphisms
A polymorphism is a feature
PolymorphismsPolymorphisms
A polymorphism is a feature - common to everybody
PolymorphismsPolymorphisms
A polymorphism is a feature - common to everybody - not identical in everybody
PolymorphismsPolymorphisms
A polymorphism is a feature - common to everybody - not identical in everybody- the possible variants (alleles) are just a few
PolymorphismsPolymorphisms
E.g. think of eye-coloreye-color
A polymorphism is a feature - common to everybody - not identical in everybody- the possible variants (alleles) are just a few
PolymorphismsPolymorphisms
A polymorphism is a feature - common to everybody - not identical in everybody- the possible variants (alleles) are just a few
E.g. think of eye-coloreye-color
Or blood-typeblood-type for a feature not visible from outside
At DNA level, a polymorphism is a sequence of nucleotidesvarying in a population.
At DNA level, a polymorphism is a sequence of nucleotidesvarying in a population.
The shortest possible sequence has only 1 nucleotide, hence
SSingle NNucleotide PPolymorphism (SNP)
At DNA level, a polymorphism is a sequence of nucleotidesvarying in a population.
The shortest possible sequence has only 1 nucleotide, hence
SSingle NNucleotide PPolymorphism (SNP)
atcggattagttagggcacaggacggac
atcggattagttagggcacaggacggac
atcggattagttagggcacaggacggac
atcggattagttagggcacaggacggac
atcggattagttagggcacaggacggac
atcggattagttagggcacaggacggac
atcggattagttagggcacaggacggac
atcggattagttagggcacaggacggac
atcggattagttagggcacaggacggac
atcggattagttagggcacaggacggac
atcggattagttagggcacaggacggac
atcggattagttagggcacaggacggac
atcggattagttagggcacaggacggac
atcggattagttagggcacaggacggac
At DNA level, a polymorphism is a sequence of nucleotidesvarying in a population.
The shortest possible sequence has only 1 nucleotide, hence
SSingle NNucleotide PPolymorphism (SNP)
atcggattagttagggcacaggacggac
atcggattagttagggcacaggacgtac
atcggcttagttagggcacaggacgtac
atcggattagttagggcacaggacggac
atcggcttagttagggcacaggacgtac
atcggcttagttagggcacaggacggac
atcggattagttagggcacaggacgtac
atcggattagttagggcacaggacgtac
atcggattagttagggcacaggacggac
atcggcttagttagggcacaggacggac
atcggattagttagggcacaggacggac
atcggcttagttagggcacaggacggac
atcggattagttagggcacaggacggac
atcggattagttagggcacaggacggac
- SNPs are predominant form of human variations
atcggattagttagggcacaggacggac
atcggattagttagggcacaggacgtac
atcggcttagttagggcacaggacgtac
atcggattagttagggcacaggacggac
atcggcttagttagggcacaggacgtac
atcggcttagttagggcacaggacggac
atcggattagttagggcacaggacgtac
atcggattagttagggcacaggacgtac
atcggattagttagggcacaggacggac
atcggcttagttagggcacaggacggac
atcggattagttagggcacaggacggac
atcggcttagttagggcacaggacggac
atcggattagttagggcacaggacggac
atcggattagttagggcacaggacggac
- Used for drug design, study disease, forensic, evolutionary...
- On average one every 1,000 bases
- Multimillion dollar SNP consortium project
atcggattagttagggcacaggacggac
atcggattagttagggcacaggacgtac
atcggcttagttagggcacaggacgtac
atcggattagttagggcacaggacggac
atcggcttagttagggcacaggacgtac
atcggcttagttagggcacaggacggac
atcggattagttagggcacaggacgtac
atcggattagttagggcacaggacgtac
atcggattagttagggcacaggacggac
atcggcttagttagggcacaggacggac
atcggattagttagggcacaggacggac
atcggcttagttagggcacaggacggac
atcggattagttagggcacaggacggac
atcggattagttagggcacaggacggac
- Goal: associate SNPs (or group of SNPs) to genetic diseases
- 1st step: build maps of several thousand SNPs
atcggattagttagggcacaggacggac
atcggattagttagggcacaggacgtac
atcggcttagttagggcacaggacgtac
atcggattagttagggcacaggacggac
atcggcttagttagggcacaggacgtac
atcggcttagttagggcacaggacggac
atcggattagttagggcacaggacgtac
atcggattagttagggcacaggacgtac
atcggattagttagggcacaggacggac
atcggcttagttagggcacaggacggac
atcggattagttagggcacaggacggac
atcggcttagttagggcacaggacggac
atcggattagttagggcacaggacggac
atcggattagttagggcacaggacggac
HOMOZYGOUSHOMOZYGOUS: same allele on both chromosomes
atcggattagttagggcacaggacggac
atcggattagttagggcacaggacgtac
atcggcttagttagggcacaggacgtac
atcggattagttagggcacaggacggac
atcggcttagttagggcacaggacgtac
atcggcttagttagggcacaggacggac
atcggattagttagggcacaggacgtac
atcggattagttagggcacaggacgtac
atcggattagttagggcacaggacggac
atcggcttagttagggcacaggacggac
atcggattagttagggcacaggacggac
atcggcttagttagggcacaggacggac
atcggattagttagggcacaggacggac
atcggattagttagggcacaggacggac
HOMOZYGOUSHOMOZYGOUS: same allele on both chromosomes
atcggattagttagggcacaggacggac
atcggattagttagggcacaggacgtac
atcggcttagttagggcacaggacgtac
atcggattagttagggcacaggacggac
atcggcttagttagggcacaggacgtac
atcggcttagttagggcacaggacggac
atcggattagttagggcacaggacgtac
atcggattagttagggcacaggacgtac
atcggattagttagggcacaggacggac
atcggcttagttagggcacaggacggac
atcggattagttagggcacaggacggac
atcggcttagttagggcacaggacggac
atcggattagttagggcacaggacggac
atcggattagttagggcacaggacggac
HOMOZYGOUSHOMOZYGOUS: same allele on both chromosomes
HETEROZYGOUSHETEROZYGOUS: different alleles
atcggattagttagggcacaggacggac
atcggattagttagggcacaggacgtac
atcggcttagttagggcacaggacgtac
atcggattagttagggcacaggacggac
atcggcttagttagggcacaggacgtac
atcggcttagttagggcacaggacggac
atcggattagttagggcacaggacgtac
atcggattagttagggcacaggacgtac
atcggattagttagggcacaggacggac
atcggcttagttagggcacaggacggac
atcggattagttagggcacaggacggac
atcggcttagttagggcacaggacggac
atcggattagttagggcacaggacggac
atcggattagttagggcacaggacggac
HOMOZYGOUSHOMOZYGOUS: same allele on both chromosomes
HETEROZYGOUSHETEROZYGOUS: different alleles
atcggattagttagggcacaggacggac
atcggattagttagggcacaggacgtac
atcggcttagttagggcacaggacgtac
atcggattagttagggcacaggacggac
atcggcttagttagggcacaggacgtac
atcggcttagttagggcacaggacggac
atcggattagttagggcacaggacgtac
atcggattagttagggcacaggacgtac
atcggattagttagggcacaggacggac
atcggcttagttagggcacaggacggac
atcggattagttagggcacaggacggac
atcggcttagttagggcacaggacggac
atcggattagttagggcacaggacggac
atcggattagttagggcacaggacggac
HOMOZYGOUSHOMOZYGOUS: same allele on both chromosomes
HETEROZYGOUSHETEROZYGOUS: different alleles
HAPLOTYPEHAPLOTYPE: chromosome content at SNP sites
atcggcttagttagggcacaggacgtac
atcggattagttagggcacaggacggac
atcggcttagttagggcacaggacgtac
atcggcttagttagggcacaggacggac
atcggattagttagggcacaggacgtac
atcggattagttagggcacaggacgt
atcggattagttagggcacaggacggac
atcggcttagttagggcacaggacggac
atcggattagttagggcacaggacggac
atcggcttagttagggcacaggacggac
atcggattagttagggcacaggacggac
atcggattagttagggcacaggacggac
HOMOZYGOUSHOMOZYGOUS: same allele on both chromosomes
HETEROZYGOUSHETEROZYGOUS: different alleles
HAPLOTYPEHAPLOTYPE: chromosome content at SNP sites
atcggattagttagggcacaggacggac
atcggattagttagggcacaggacgtac
ag at
ct ag
ct cg
at at
ag cg
ag cg
ag ag
HOMOZYGOUSHOMOZYGOUS: same allele on both chromosomes
HETEROZYGOUSHETEROZYGOUS: different alleles
HAPLOTYPEHAPLOTYPE: chromosome content at SNP sites
ag at
ct ag
ct cg
at at
ag cg
ag cg
ag ag
HOMOZYGOUSHOMOZYGOUS: same allele on both chromosomes
HETEROZYGOUSHETEROZYGOUS: different alleles
HAPLOTYPEHAPLOTYPE: chromosome content at SNP sites
GENOTYPEGENOTYPE: “union” of 2 haplotypes
OcE
EE
OaOg
OaE OaOt
EOg
OgE
ag at
ct ag
ct cg
at at
ag cg
ag cg
ag ag
OcE
EE
OaOg
OaE OaOt
EOg
OgE
CHANGE OF SYMBOLSCHANGE OF SYMBOLS: each SNP only two values in a poplulation (bio).
Call them 1 and O. Also, call * the fact that a site is heterozygous
HAPLOTYPEHAPLOTYPE: string over 1,OGENOTYPEGENOTYPE: string over 1,O,*
1o 11
o1 1o
o1 oo
11 11
1o oo
1o oo
1o 1o
o*
**
*o
1* 11
*o
*o
CHANGE OF SYMBOLSCHANGE OF SYMBOLS: each SNP only two values in a poplulation (bio).
Call them 1 and O. Also, call * the fact that a site is heterozygous
HAPLOTYPEHAPLOTYPE: string over 1,OGENOTYPEGENOTYPE: string over 1,O,*
THE HAPLOTYPING PROBLEMTHE HAPLOTYPING PROBLEM
Single IndividualSingle Individual: Given genomic data of one individual, determine 2 haplotypes (one per chromosome)
Population Population : Given genomic data of k individuals, determine (at most) 2k haplotypes (one per chromosome/indiv.)
For the individual problem, input is erroneous haplotype data, from sequencing
For the population problem, data is ambiguous genotype data, from screening
OBJ is lead by Occam’s razor: find minimum explanation of observed data under given hypothesis (a.k.a. parsimony principle)
Theory and ResultsTheory and Results
- Polynomial Algorithms for gapless haplotyping (L, Bafna, Istrail, Lippert, Schwartz 01 & Bafna, L, Istrail, Rizzi 02)
- Polynomial Algorithms for bounded-length gapped haplotyping (BLIR 02)
Single individual
- NP-hardness for general gapped haplotyping (LBILS 01)
- APX-hardness (Gusfield 00)
- Reduction to Graph-Theoretic model and I.P. approach (Gusfield 01)
Population
-New formulations and Disease Detection (L, Ravi, Rizzi, 02)
- Exact algorithms for min-size solution (L,Serafini 2011)
- Heuristics (Tininini, L, Bertolazzi 2010)
The Single-IndividualThe Single-IndividualHaplotyping problemHaplotyping problem
TGAGCCTAG GATTT GCCTAG CTATCTT
ATAGATA GAGATTTCTAGAAATC ACTGA
TAGAGATTTC TCCTAAAGAT CGCATAGATA
fragmentation
sequencing
assembly
Shotgun Assembly of a Chromosome [ Webber and Myers, 1997]
ACTGCAGCCTAGAGATTCTCAGATATTTCTAGGCGTATCTATCTTACTGCAGCCTAGAGATTCTCAGATATTTCTAGGCGTATCTATCTTACTGCAGCCTAGAGATTCTCAGATATTTCTAGGCGTATCTATCTT
ACTGCAGCCTAGAGATTCTCAGATATTTCTAGGCGTATCTATCTT
-Sequencing errors:
ACTGCCTGGCCAATGGAACGGACAAG CTGGCCAAT CATTGGAAC AATGGAACGGA
-Contaminants
MAIN ERROR SOURCESMAIN ERROR SOURCES
Given errorserrors, the data may be inconsistentinconsistent with exactly 2 haplotypes
PROBLEMPROBLEM: Find and remove : Find and remove the errors so that the data the errors so that the data becomes consistent with becomes consistent with exactly 2 haplotypesexactly 2 haplotypes
Hence, assembler is unable Hence, assembler is unable to build 2 chromosomesto build 2 chromosomes
ACTGAAAGCGA ACTAGAGACAGCATGACTGATAGC GTAGAGTCAACTG TCGACTAGA CATGACTGA CGATCCATCG TCAGCACTGAAA ATCGATC AGCATGACTGAAAGCGA ACTAGAGACAGCATGACTGATAGC GTAGAGTCAACTG TCGACTAGA CATGACTGA CGATCCATCG TCAGCACTGAAA ATCGATC AGCATG 1 1 O O O 1 1 1 1 1 O
The data: a SNP matrix
Snips 1,..,n
1 2 3 4 5 6 7 8 9 1 - - - O X X O O - 2 - O - O X - - - X3 X X O X X - - - - 4 O O X - - - - O - 5 - - - - - - - X O6 - - - - O O O X -
Fragments 1,..,m
Snips 1,..,n
1 2 3 4 5 6 7 8 9 1 - - - O X X O O - 2 - O - O X - - - X3 X X O X X - - - - 4 O O X - - - - O - 5 - - - - - - - X O6 - - - - O O O X -
Fragments 1,..,m
Fragment conflict: can’t be on same haplotype
Snips 1,..,n
1 2 3 4 5 6 7 8 9 1 - - - O X X O O - 2 - O - O X - - - X3 X X O X X - - - - 4 O O X - - - - O - 5 - - - - - - - X O6 - - - - O O O X -
Fragments 1,..,m
Fragment conflict: can’t be on same haplotype
1
6
2
3
4
5
Fragment Conflict Graph GF(M)
We have 2 haplotypes iff GF is BIPARTITE
Snips 1,..,n
1 2 3 4 5 6 7 8 9 1 - - - O X X O O - 2 - O - O X - - - X3 X X O X X - - - - 4 O O X - - - - O - 5 - - - - - - - X O6 - - - - O O O X -
Fragments 1,..,m
1
6
2
3
4
5
PROBLEM (Fragment Removal): make GF Bipartite
Snips 1,..,n
1 2 3 4 5 6 7 8 9 1 - - - O X X O O - 2 - O - O X - - - X3 X X O X X - - - - 4 O O X - - - - O - 5 - - - - - - - X O6 - - - - O O O X -
Fragments 1,..,m
PROBLEM (Fragment Removal): make GF Bipartite
1
6
2
3
4
5
1 2 3 4 5 6 7 8 9 1 - - - O X X O O - 2 - O - O X - - - X4 O O X - - - - O -
3 X X O X X - - - -5 - - - - - - - X O
O O X O X X O O X
X X O X X - - X O
Removing fewest fragments is equivalent to maximum induced bipartite subgraph
NP-complete [Yannakakis, 1978a, 1978b; Lewis, 1978] O(|V|(log log |V|/log |V|)2)-approximable [Halldórsson, 1999] not O(|V|)-approximable for some [Lund and Yannakakis, 1993]
Are there cases of M for which GF(M) is easier?
YES: the gapless M
---OXXOO---OXOOX--- gap
---OXXOOXOXOXOOX--- gapless
---OXX--XO----OX--- 2 gaps
Why gaps?
Sequencing errors (don’t call with low confidence)
---OOXX?XX--- ===> ---OOXX-XX---
Celera’s mate pairs
attcgttgtagtggtagcctaaatgtcggtagaccttga
attcgttgtagtggtagcctaaatgtcggtagaccttga
THEOREM
For a gapless M, the Min Fragment RemovalProblem is Polynomial
NOTENOTE: Does not need to be gapless. Enough if it can be sorted to become such (Consecutive Ones Property, Booth and Lueker, 1976)
An O(nm + n ) D.P. algoAn O(nm + n ) D.P. algo3
1 - O O X X O O - -2 - - X O X X O - -3 - - - X X O - - - 4 - - - - O O X O - 5 - - - - - X O X O
An O(nm + n ) D.P. algoAn O(nm + n ) D.P. algo3
1 - O O X X O O - -2 - - X O X X O - -3 - - - X X O - - - 4 - - - - O O X O - 5 - - - - - X O X O
LFT(i) RGT(i)
sort according to LFT
An O(nm + n ) D.P. algoAn O(nm + n ) D.P. algo3
1 - O O X X O O - -2 - - X O X X O - -3 - - - X X O - - - 4 - - - - O O X O - 5 - - - - - X O X O
LFT(i) RGT(i)
D(i;h,k) := min cost to solve up to row i, with k, h not removed and put in different haplotypes, and maximizing RGT(k), RGT(h)
sort according to LFT
D(i; h,k) =
D(i-1; h,k) if i, k compatible and RGT(i) <= RGT(k) or i, h compatible and RGT(i) <= RGT(h)
1 + D(i-1; h, k) otherwise{
OPT is min h,k D( n; h, k ) and can be found in time O(nm + n^3)
Th: NP-Hard if 2 gaps per fragment
proof: (simple) use fact that for every G there is M s.t. G = GF(M) and reduce from Max Bip. InducedSubgraph on 3-regular graphs
Th : NP-Hard if even 1 gap per fragment proof: technical. reduction from MAX2SAT
WITH GAPS…..WITH GAPS…..
But, gaps must be long for problem to be difficult.
We have O( 2 mn + 2 n ) D.P.
for MFR on matrix with total gaps length L
2L 3L 3
What for MFR with gaps? Why not ILP...What for MFR with gaps? Why not ILP...
min xff
xf >= 1 for all odd cycles Cf\in C
x \in {0,1}^n
What for MFR with gaps? Why not ILP...What for MFR with gaps? Why not ILP...
min xff
xf >= 1 for all odd cycles Cf\in C
x \in {0,1}^n
1
5 2
34
1/2
1/3
1/41/2
0
What for MFR with gaps? Why not ILP...What for MFR with gaps? Why not ILP...
min xff
xf >= 1 for all odd cycles Cf\in C
x \in {0,1}^n
1
5 2
34
1/2
1/3
1/41/2
01
5 2
34
1
5 2
34
What for MFR with gaps? Why not ILP...What for MFR with gaps? Why not ILP...
min xff
xf >= 1 for all odd cycles Cf\in C
x \in {0,1}^n
1
5 2
34
1/2
1/3
1/41/2
01
5 2
34
1
5 2
34
5/12 5/12
What for MFR with gaps? Why not ILP...What for MFR with gaps? Why not ILP...
min xff
xf >= 1 for all odd cycles Cf\in C
x \in {0,1}^n
1
5 2
34
1/2
1/3
1/41/2
01
5 2
34
1
5 2
34
5/12 5/12
What for MFR with gaps? Why not ILP...What for MFR with gaps? Why not ILP...
min xff
xf >= 1 for all odd cycles Cf\in C
x \in {0,1}^n
1
5 2
34
1/2
1/3
1/41/2
01
5 2
34
1
5 2
34
5/12 5/12
What for MFR with gaps? Why not ILP...What for MFR with gaps? Why not ILP...
min xff
xf >= 1 for all odd cycles Cf\in C
x \in {0,1}^n
1
5 2
34
1/2
1/3
1/41/2
01
5 2
34
1
5 2
34
5/12 5/12
Randomized rounding heuristic: round and repeat. Worked well at Celera
The fragment removal is good to get rid of contaminants.
However, we may want to keep all fragments andcorrect errors otherwise
A dual point of view is to disregard some SNPs and keepthe largest subset sufficient to reconstruct the haplotypes
All fragments get assigned to one of the two haplotypes.We describe the min SNP removal problem: remove the fewest number of columns from M so that the fragmentgraph becomes bipartite.
- - - O X X O O - - O X O X - - - XX X O X X - - - - O O X - - - O O - - - - - - - X X O- - - - O O O X -
SNP conflicts
- - - O X X O O - - O X O X - - - XX X O X X - - - - O O X - - - O O - - - - - - - X X O- - - - O O O X -
SNP conflicts
OK
- - - O X X O O - - O X O X - - - XX X O X X - - - - O O X - - - O O - - - - - - - X X O- - - - O O O X -
SNP conflicts
OK
- - - O X X O O - - O X O X - - - XX X O X X - - - - O O X - - - O O - - - - - - - X X O- - - - O O O X -
SNP conflicts
OK
- - - O X X O O - - O X O X - - - XX X O X X - - - - O O X - - - O O - - - - - - - X X O- - - - O O O X -
SNP conflicts
CONFLICT !
- - - O X X O O - - O X O X - - - XX X O X X - - - - O O X - - - O O - - - - - - - X X O- - - - O O O X -
SNP conflicts
CONFLICT !
- - - O X X O O - - O X O X - - - XX X O X X - - - - O O X - - - O O - - - - - - - X X O- - - - O O O X -
SNP conflicts
SNP conflict graph GS(M)1 node for each SNP (column)edge between conflicting SNPs
1 2 3 4 5 6 7 8 9 - - - O X X O O - - O X O X - - - XX X O X X - - - - O O X - - - O O - - - - - - - X X O- - - - O O O X -
SNP conflicts
1 2 3 4 5 6 7 8 9 - - - O X X O O - - O X O X - - - XX X O X X - - - - O O X - - - O O - - - - - - - X X O- - - - O O O X -
SNP conflicts
1
6
2
3
4
5
8
9
7
1 2 3 4 5 6 7 8 9 - - - O X X O O - - O X O X - - - XX X O X X - - - - O O X - - - O O - - - - - - - X X O- - - - O O O X -
SNP conflicts
1
6
2
3
4
5
8
9
7
THEOREM 1
For a gapless M, GF(M) is bipartiteif and only if GS(M) is an independent set
THEOREM 2
For a gapless M, GS(M) is a perfect graph
COROLLARY
For a gapless M, the min SNP removalproblem is polynomial
THEOREM 1For a gapless M, GF(M) is bipartite if and only if
GS(M) is an independent set
PROOF (sketch): by minimal counterexample
--OOXXOO-------------OOXOOXOXXO-----------XXOXOXXX-----XXOOXOXXO-----------XOOOX-----------XXXXXO-------XXOXXOXOO------
Assume M gapless, GS(M) an independent set, but GF(M)not bipartite.
Take an odd cycle in GF
THEOREM 1For a gapless M, GF(M) is bipartite if and only if
GS(M) is an independent set
PROOF (sketch): by minimal counterexample
--O?X???-------------O????????O-----------??O??X??-----??????X??-----------???O?-----------????X?-------X???????O------
There is a generic structure of hor-vert cycle
THEOREM 1For a gapless M, GF(M) is bipartite if and only if
GS(M) is an independent set
PROOF (sketch): by minimal counterexample
--O?X???-------------O????????O-----------??O??X??-----??????X??-----------???O?-----------????X?-------X???????O------
“vertical lines”
There cannot be only one vertical line in odd cycle
We merge rightmost and next to reduce them by 1
Hence, there cannot be a minimal (in n. of vertical lines) counterexample
THEOREM 1For a gapless M, GF(M) is bipartite if and only if
GS(M) is an independent set
PROOF (sketch): by minimal counterexample
--O?X???-------------O????????O-----------??O??X??-----??????X??-----------???O?-----------????X?-------X???????O------
“vertical lines”
Must be X
THEOREM 1For a gapless M, GF(M) is bipartite if and only if
GS(M) is an independent set
PROOF (sketch): by minimal counterexample
--O?X???-------------O?????X??O-----------??O??X??-----??????X??-----------???O?-----------????X?-------X???????O------
“vertical lines”
Must be X
Merge the rightmost lines
THEOREM 1For a gapless M, GF(M) is bipartite if and only if
GS(M) is an independent set
PROOF (sketch): by minimal counterexample
--O?X???-------------O?????X--------------??O----------??????X-------------???O------------????X--------X???????O------
“vertical lines”
Still a counterexample!
Merge the rightmost lines
1 2 31 O - O 2 - O X 3 X X -
Note: Theorem not true if there are gaps
1
2 3
1
2 3
GF(M) GS(M)
M
THEOREM 2For a gapless M, GS(M) is a perfect graph
PROOF: GS(M) is the complement of a comparability graph A
Comparability graphs are perfect
Comparability Graphs: unoriented that can be oriented to become a partial order
LEMMA: If i<j<k and (i,k) is a SNP conflict then either (i,k) or (j,k) is also a SNP conflict
i j k - X O O ? X O X - - O X O ? X X X -
Equal:conflicts with i
OO
Different:conflicts with k
OX
i kj
I.e. if (i,j) is not a conflict and (j,k) is not a conflict, also (i,k) is not a conflict
So (u,v) with u < v and u not a conflict with v is a comparability graph Aand GS is A complement
NOTE: ind set on perfect graph is in P (Lovasz, Schrijvers, Groetschel, 84)
THEOREM: The min SNP removal is NP-hard if there can be gaps (Reduction from MAXCUT)
Again, gaps must be long for problem to be difficult.
We have O(mn + n ) D.P.
for MSR on matrix with total gaps length L
2L + 1 2L + 2
Hence gapless MSR is polynomial (max stable set on perfect graph).
There are better, D.P., algorithms, O(mn + m^2)
What if gaps ?
The PopulationThe PopulationHaplotyping problemHaplotyping problem
The input is GENOTYPE data
oooxx
xxoxx
?x??x
????x
xx??x
INPUT: G = { xx??x, ????x, xxoxx, ?x??x, oooxx }
The input is GENOTYPE data
xxoxxxxxox
oooxx
oooxxxxxox
xxoxxoxxox
xxoxxxxoxx
oooxxoooxx
xxoxx
?x??x
????x
xx??x
OUTPUT: H = { xxoxx, xxxox, oooxx, oxxox}
INPUT: G = { xx??x, ????x, xxoxx, ?x??x, oooxx }
Each genotype is explained by two haplotypes
We will define some objectives for H
1st Objective1st Objective (open research problem):
minimize |H|
2nd Objective2nd Objective based on inference rule:
1st Objective (parsimony)1st Objective (parsimony) :
minimize |H|
An easy SQRT(n) approximation: k haplotypes can explain at most k(k-1)/2 genotypes, hence, we need at least LB = SQRT(n) haplotypes.
BUT any greedy algorithm can find 2 haplotypes to explain a genotype, giving asolution of <= 2n haplotypes, i.e. <= SQRT(n) * LB
It’s difficult, but not impossible, to come up with better approximations, like constants(Lancia, Pinotti, Rizzi ’02)
2nd Objective2nd Objective based on inference rule:
xoxxooxoxx +********** =x??xoox?x?
known haplotype h
known (ambiguos) genotype g
Inference RuleInference Rule
xoxxooxoxx +xxoxooxxxo =x??xoox?x?
known haplotype h
known (ambiguos) genotype g
new (derived) haplotype h’
Inference RuleInference Rule
xoxxooxoxx +xxoxooxxxo =x??xoox?x?
known haplotype h
known (ambiguos) genotype g
new (derived) haplotype h’
We write h + h’ = g
g and h must be compatible to derive h’
Inference RuleInference Rule
2nd Objective (Clark, 1990)2nd Objective (Clark, 1990)
1. Start with H = nonambiguos genotypes2. while exists ambiguos genotype g in G3. take h in H compatible with g and let h + h’ = g4. set H = H + {h’} and G = G - {g}5. end while
2nd Objective (Clark, 1990)2nd Objective (Clark, 1990)
1. Start with H = nonambiguos genotypes2. while exists ambiguos genotype g in G3. take h in H compatible with g and let h + h’ = g4. set H = H + {h’} and G = G - {g}5. end while
If, at end, G is empty, SUCCESS, otherwise FAILURE
Step 3 is non-deterministic
2nd Objective (Clark, 1990)2nd Objective (Clark, 1990)
1. Start with H = nonambiguos genotypes2. while exists ambiguos genotype g in G3. take h in H compatible with g and let h + h’ = g4. set H = H + {h’} and G = G - {g}5. end while
If, at end, G is empty, SUCCESS, otherwise FAILURE
Step 3 is non-deterministic
ooooxooo??ooxx??
2nd Objective (Clark, 1990)2nd Objective (Clark, 1990)
1. Start with H = nonambiguos genotypes2. while exists ambiguos genotype g in G3. take h in H compatible with g and let h + h’ = g4. set H = H + {h’} and G = G - {g}5. end while
If, at end, G is empty, SUCCESS, otherwise FAILURE
Step 3 is non-deterministic
ooooxooo??ooxx??
xxoo
2nd Objective (Clark, 1990)2nd Objective (Clark, 1990)
1. Start with H = nonambiguos genotypes2. while exists ambiguos genotype g in G3. take h in H compatible with g and let h + h’ = g4. set H = H + {h’} and G = G - {g}5. end while
If, at end, G is empty, SUCCESS, otherwise FAILURE
Step 3 is non-deterministic
ooooxooo??ooxx??
xxoo xxxx SUCCESS
2nd Objective (Clark, 1990)2nd Objective (Clark, 1990)
1. Start with H = nonambiguos genotypes2. while exists ambiguos genotype g in G3. take h in H compatible with g and let h + h’ = g4. set H = H + {h’} and G = G - {g}5. end while
If, at end, G is empty, SUCCESS, otherwise FAILURE
Step 3 is non-deterministic
ooooxooo??ooxx??
2nd Objective (Clark, 1990)2nd Objective (Clark, 1990)
1. Start with H = nonambiguos genotypes2. while exists ambiguos genotype g in G3. take h in H compatible with g and let h + h’ = g4. set H = H + {h’} and G = G - {g}5. end while
If, at end, G is empty, SUCCESS, otherwise FAILURE
Step 3 is non-deterministic
ooooxooo??ooxx??
oxoo
2nd Objective (Clark, 1990)2nd Objective (Clark, 1990)
1. Start with H = nonambiguos genotypes2. while exists ambiguos genotype g in G3. take h in H compatible with g and let h + h’ = g4. set H = H + {h’} and G = G - {g}5. end while
If, at end, G is empty, SUCCESS, otherwise FAILURE
Step 3 is non-deterministic
ooooxooo??ooxx??
oxoo FAILURE (can’t resolve xx?? )
OBJ: find order of application rule that leaves the fewest elements in GOBJ: find order of application rule that leaves the fewest elements in G
- Problem is APX-hard (Gusfield,00)
- Graph-Model + Integer Programming for practical solution (G.,01)
- Problem is APX-hard (Gusfield,00)
- Graph-Model + Integer Programming for practical solution (G.,01)
x??o?
1. expand genotypes
- Problem is APX-hard (Gusfield,00)
- Graph-Model + Integer Programming for practical solution (G.,01)
x??o?
xxxox
xxxoo
xxoox
xxooo
xoxox
xooox
xoxoo
xoooo
1. expand genotypes
- Problem is APX-hard (Gusfield,00)
- Graph-Model + Integer Programming for practical solution (G.,01)
x??o?
xxxox
xxxoo
xxoox
xxooo
xoxox
xooox
xoxoo
xoooo
2. create (h, h’) if exists g s.t. h’ can bederived from g and h
1. expand genotypes 3. Largest number of nodes in forest
rooted at unambiguos genotpes = = largest number of ambiguous genotypes resolved
Hence, find largest number of nodes in forest rooted at unambiguos genotpes. Use I.P. model with vars x(ij).
This reduction is exponential. Is there a better practical approach?
3rd Objective3rd Objective (open research problem)Disease Detection:
oooxx
??oxx
?x??x
????x
xx??x
INPUT: G = { xx??x, ????x, ??oxx, ?x??x, oooxx }
3rd Objective3rd Objective (open research problem)Disease Detection:
xxoxxxxxox
oooxx
oooxxxxxox
xxoxxoxxox
xxoxxoooxx
oooxxoooxx
??oxx
?x??x
????x
xx??x
OUTPUT: H = { xxoxx, xxxox, oooxx, oxxox}
H contains H’, s.t. each diseased has one haplotype in H’ and each healty none
minimize | H|
INPUT: G = { xx??x, ????x, ??oxx, ?x??x, oooxx }
Genome Rearrangements and Genome Rearrangements and Evolutionary DistancesEvolutionary Distances
Each species has a genome (organized in pairs of chromosomes)
tcgtgatggat………………ttgatggattga
tcgattatggat………………ttttgatatcca
Genomes evolve by means of
•Insertions•Deletions•Inversions•Transpositions•Translocations
of DNA regions
deletion
deletioninsertion
deletioninsertion
translocation
deletioninsertion
translocation
inversion
deletioninsertion
translocation
inversion
transposition
Combinatorial problem: given 2 permutations P, Q and operators in a set F find ashortest sequence f1, ..fk of operators such that Q = fk(fk-1(…(f1(P))))
Very difficult problem! We focus on operators all of the same type (e.g. inversions)(…still difficult…)
Wlog we can take Q = (1 2 … n). Hence we talk of sorting by … (inversions, transpositions…)
5 6 4 8 3 2 1 9 7Example:
We focus on inversions, that are the most important in Nature
1 2 3 8 4 6 5 9 7
1 2 3 8 4 5 6 9 7
1 2 3 6 5 4 8 9 7
1 2 3 6 5 4 8 7 9
1 2 3 4 5 6 8 7 9
1 2 3 4 5 6 7 8 9
Combinatorial problem: given 2 permutations P, Q and operators in a set F find ashortest sequence f1, ..fk of operators such that Q = fk(fk-1(…(f1(P))))
Very difficult problem! We focus on operators all of the same type (e.g. inversions)(…still difficult…)
Wlog we can take Q = (1 2 … n). Hence we talk of sorting by … (inversions, transposition…)
+5 +6 -4 -8 -3 -2 -1 -9 +7Example:
We focus on inversions, that are the most important in Nature
+1 +2 +3 +8 +4 -6 -5 -9 +7
+1 +2 +3 +8 +4 +5 +6 -9 +7
+1 +2 +3 -6 -5 -4 -8 -9 +7
+1 +2 +3 -6 -5 -4 -8 -7 +9
+1 +2 +3 +4 +5 +6 -8 -7 +9
+1 +2 +3 +4 +5 +6 +7 +8 +9
There is also a SIGNED VERSION of the problem !
(Unsigned) Sorting by Inversions is NP-hard (longstanding question, settled by Caprara ‘98)
Surprisingly, Signed Sorting by Inversions is Polynomial (beautiful theory, by Hannenhalli and Pevzner)
The complexity of Sorting by Transpositions, e.g., is unknown
5 7 8 2 1 4 3 6 9
The concept of breakpoint
reakpoint at position i if(i) - (i+1) | > 1
0 10
(Unsigned) Sorting by Inversions is NP-hard (longstanding question, settled by Caprara ‘98)
Surprisingly, Signed Sorting by Inversions is Polynomial (beautiful theory, by Hannenhalli and Pevzner)
The complexity of Sorting by Transpositions, e.g., is unknown
(Unsigned) Sorting by Inversions is NP-hard (longstanding question, settled by Caprara ‘98)
Surprisingly, Signed Sorting by Inversions is Polynomial (beautiful theory, by Hannenhalli and Pevzner)
The complexity of Sorting by Transpositions, e.g., is unknown
5 7 8 2 1 4 3 6 9
The concept of breakpoint
reakpoint at position i if(i) - (i+1) | > 1
0 10
d() = inversion distanceb() = # breakpoints
TRIVIAL BOUND: d() >= b() / 2
Example: d() >= 6 / 2 = 3
The Breakpoint GraphBreakpoint Graph
5 7 8 2 1 4 3 6 9 0
10
The Breakpoint GraphBreakpoint Graph
5 7 8 2 1 4 3 6 9 0
10
The Breakpoint GraphBreakpoint Graph
5 7 8 2 1 4 3 6 9 0
10
The Breakpoint GraphBreakpoint Graph
5 7 8 2 1 4 3 6 9 0
10
The Breakpoint GraphBreakpoint Graph
5 7 8 2 1 4 3 6 9 0
10
10 64
Each node has degree...
0 2 or 4 …
hence the graph can be decomposed in cycles!
The Breakpoint GraphBreakpoint Graph
5 7 8 2 1 4 3 6 9 0
10
Alternating cycle decomposition
The Breakpoint GraphBreakpoint Graph
5 7 8 2 1 4 3 6 9 0
10
Alternating cycle decomposition
The Breakpoint GraphBreakpoint Graph
5 7 8 2 1 4 3 6 9 0
10
Alternating cycle decomposition
c() = max # cycles in alternating decomposition
VERY STRONG BOUND : d () >= b() - c()
Example: c()= 2 and d () >= 6 - 2 = 4
The Breakpoint GraphBreakpoint Graph
5 7 8 2 1 4 3 6 9 0
10
The best algorithm for this problem is based on an Integer Programmingformulation of the max cycle decomposition
A variable xC for each cycle (exponential # of vars…)
A constraint xC = 1 for each edge e
Objective: maximize C xC
C containing e
max xCC
xC = 1 for all edges eC\ni e
xC \in {0,1} for all alt. cycles C
PRIMAL
min yee
ye <= 1 for all alt. Cycles Ce\in C
ye \in R for all edges e
DUAL
max xCC
xC = 1 for all edges eC\ni e
xC \in {0,1} for all alt. cycles C
PRIMAL
min yee
ye <= 1 for all alt. Cycles Ce\in C
ye \in R for all edges e
DUAL
5 7 8 2 1 4 3 6 9 0
10
Pricing out the cycles for which y*(C) < 1Pricing out the cycles for which y*(C) < 1
5 7 8 2 1 4 3 6 9 0
10
5 7 8 2 1 4 3 6 9 0
10
Split the graph in two copiesSplit the graph in two copies
5 7 8 2 1 4 3 6 9 0
10
5 7 8 2 1 4 3 6 9 0
10
Connect twinsConnect twins
5 7 8 2 1 4 3 6 9 0
10
5 7 8 2 1 4 3 6 9 0
10
A perfect matching corresponds to (a set of) alternating cyclesA perfect matching corresponds to (a set of) alternating cycles
5 7 8 2 1 4 3 6 9 0
10
5 7 8 2 1 4 3 6 9 0
10
A perfect matching corresponds to (a set of) alternating cyclesA perfect matching corresponds to (a set of) alternating cycles
5 7 8 2 1 4 3 6 9 0
10
5 7 8 2 1 4 3 6 9 0
10
A perfect matching corresponds to (a set of) alternating cyclesA perfect matching corresponds to (a set of) alternating cycles
5 7 8 2 1 4 3 6 9 0
10
5 7 8 2 1 4 3 6 9 0
10
A perfect matching corresponds to (a set of) alternating cyclesA perfect matching corresponds to (a set of) alternating cycles
5 7 8 2 1 4 3 6 9 0
10
5 7 8 2 1 4 3 6 9 0
10
A perfect matching corresponds to (a set of) alternating cyclesA perfect matching corresponds to (a set of) alternating cycles
5 7 8 2 1 4 3 6 9 0
10
5 7 8 2 1 4 3 6 9 0
10
The weight of the matching is the y*-weight of the cyclesThe weight of the matching is the y*-weight of the cycles
.2
.4
.5
1
.6
0
5 7 8 2 1 4 3 6 9 0
10
5 7 8 2 1 4 3 6 9 0
10
Forcing a cycle to use a certain nodeForcing a cycle to use a certain node
.2
.4
.5
1
.6
100000
- These cycles would not use the same node twice, but with simple trick is possible to model (OMISSIS)
BRANCH&PRICE algorithm by Caprara, Lancia, Ng (1999,2001)
BRANCH&BOUND combinatorial algorithm by Kececioglu, Sankoff (1996)
KS can solve at most n=40. Take days for n=50
CLN can solve for n=200. Takes few seconds (say 5) for n=100
NP-hard problem practically solved to optimality!
Statistical view of evolutionStatistical view of evolution
• Genome evolve by random inversions
• It’s like a random walk on a huge graph with an edge for
each permutation an edge for each inversion
• It is not clear why the shortest solution should be the
one followed by Nature (in fact, often it isn’t)
• We want to find the most likely number of inversions
that lead from (1 2 … n ) to
• We use the expected number of breakpoints after k
inversions as a way to guess the # of inversions
Let B(k) be the (r.v.) number of breakpoint after k random inversions from (1..n)
Given a obtained by h random inversions from (1 … n ) we want to estimate h
The inversion distance is only a lower bound: h >= d() but the gap could be big
We estimate E[B(k)]. Then, faced with some , we pick h such that E[B(h)] is as close as possible to b() (maximum likelihood). CL ,2000, have shown:
Question: estimate E[D(k)], the (r.v.) inversion distance after k random inversions
E[B(k)] = ( n - 1 ) ( 1 - ( ) )
n - 3n - 1
k
Example: n = 200, k (u.a.r. in 1…n) inversions
8 8 8 1619 19 19 3468 67 67 9869 73 68 10473 79 73 10985 91 83 12086 85 83 11587 90 84 119118 117 109 138184 184 135 168
k k’ d() b
Protein Structure Alignments: the Protein Structure Alignments: the Maximum Contact Map Overlap Maximum Contact Map Overlap
ProblemProblem
A ProteinProtein is a complex molecule with a primary, linear structure (a sequence of aminoacids) and a3-Dimensional structure (the protein fold).
Protein STRUCTURE determines its FUNCTION
For instance, the Drug Design problemcalls for constructing peptides with a 3Dshape complementary to a protein, so asto dock onto it.
Motivation:Motivation:Structure Alignment is Important for:
- Discovery of Protein Function (shape determines function)
- Search in 3D data bases
- Protein Classification and Evolutionary Studies
- ...
Problem: Problem: Align two 3D protein structures
Contact MapsContact Maps
Unfolded protein
CONTACT MAPSCONTACT MAPS
Unfolded protein
Folded protein = contacts
CONTACT MAPSCONTACT MAPS
Unfolded protein
Folded protein = contacts
Contact map = graph
CONTACT MAPSCONTACT MAPS
CONTACT MAPSCONTACT MAPS
Unfolded protein
Folded protein = contacts
Contact map = graph
OBJECTIVE: align 3d folds of proteins = align contact maps
Contact Map AlignmentsContact Map Alignments
Non-crossing AlignmentsNon-crossing Alignments
Protein 1
Protein 2
non-crossing map of residues in protein 1 and protein 2
The value of an alignmentThe value of an alignment
The value of an alignmentThe value of an alignment
The value of an alignmentThe value of an alignment
Value = 3
The value of an alignmentThe value of an alignment
Value = 3We want to maximize the value
The value of an alignmentThe value of an alignment
NP-Hard
The value of an alignmentThe value of an alignment
Integer Programming Integer Programming FormulationFormulation
Integer Programming Integer Programming FormulationFormulation
0-1 VARIABLES
yef for e and f contacts
e
f
yef
Integer Programming Integer Programming FormulationFormulation
0-1 VARIABLES
yef + ye’f’ <= 1
yef for e and f contacts
e
f
yef
CONSTRAINTS
e
f
e’
f’
Integer Programming Integer Programming FormulationFormulation
0-1 VARIABLES
yef + ye’f’ <= 1
yef for e and f contacts
e
f
yef
CONSTRAINTS
e
f
e’
f’
OBJECTIVE max ef yef
Independent Set ProblemIndependent Set ProblemIt’s just a huge max independent set problem in Gy:
• a node for each sharing • an edge for each pair of incompatible sharings
e
f
e’
f’f’’
e’’
ef
e’f’
e’’f’’
Independent Set ProblemIndependent Set ProblemIt’s just a huge max independent set problem in Gy:
• a node for each sharing • an edge for each pair of incompatible sharings
e
f
e’
f’f’’
e’’
ef
e’f’
e’’f’’
|Gy|=|E1|*|E2| (approximately 5000 for two proteins with 50 residues and 75 contacts each)
The best exact algorithm for independent set can solve for at most a few hundred nodes
Node to Node VariablesNode to Node VariablesNew variables x provide an easy check for the non-crossing conditions
NEW VARIABLES
xij for i and j residues
e
f
yef
i
jxij
Node to Node VariablesNode to Node VariablesNew variables x provide an easy check for the non-crossing conditions
NEW VARIABLES
xij for i and j residues
e
f
yef
NEW CONSTRAINTS
i
j
i’
j’
xij + xi’j’ <= 1
i
jxij
Node to Node VariablesNode to Node VariablesNew variables x provide an easy check for the non-crossing conditions
NEW VARIABLES
y(ip)(jq) <= xij and y(ip)(jq) <= xpq
xij for i and j residues
e
f
yef
NEW CONSTRAINTS
i
j
i’
j’
xij + xi’j’ <= 1
i
jxij
i
j
p
q
Clique ConstraintsClique ConstraintsVariables x define a graph Gx:
• A node for each line• An edge between each pair of crossing lines
i
j
i’
j’
ij
i’j’
Clique ConstraintsClique ConstraintsVariables x define a graph Gx:
• Gx is much smaller than Gy
• Gx has nice proprieties (it’s a perfect graph)• It’s easier to find large independent sets in Gx
• A node for each line• An edge between each pair of crossing lines
i
j
i’
j’
ij
i’j’
Clique ConstraintsClique ConstraintsNon-crossing constraints can be extended to
CLIQUE CONSTRAINTS
xij <= 1[i,j] in M
For all sets M of mutually incompatible (i.e. crossing) lines
All clique constraints satisfied (and Gx perfect) imply a strong bound!
Structure of Maximal cliques in Structure of Maximal cliques in GGxx
1. Pick two subsets of same size
Structure of Maximal cliques in Structure of Maximal cliques in GGxx
2. Connect them in a zig-zag fashion
Structure of Maximal cliques in Structure of Maximal cliques in GGxx
Structure of Maximal cliques in Structure of Maximal cliques in GGxx
Structure of Maximal cliques in Structure of Maximal cliques in GGxx
Structure of Maximal cliques in Structure of Maximal cliques in GGxx
Structure of Maximal cliques in Structure of Maximal cliques in GGxx
Structure of Maximal cliques in Structure of Maximal cliques in GGxx
Structure of Maximal cliques in Structure of Maximal cliques in GGxx
Structure of Maximal cliques in Structure of Maximal cliques in GGxx
Structure of Maximal cliques in Structure of Maximal cliques in GGxx
Structure of Maximal cliques in Structure of Maximal cliques in GGxx
3. Throw in all lines included in a zig or a zag
Structure of Maximal cliques in Structure of Maximal cliques in GGxx
3. Throw in all lines included in a zig or a zag
Structure of Maximal cliques in Structure of Maximal cliques in GGxx
The result is a maximal clique in Gx
Separation of Clique InequalitiesSeparation of Clique Inequalities
Separation of Clique InequalitiesSeparation of Clique InequalitiesPROBLEM
There exist exponentially many such cliques (O(22n) inequalities).
We need to generate in polynomial time a clique inequality when needed,i.e., when violated by the current LP solution x*
x*ij > 1[i,j] in M
THEOREM
We can find the most violated clique inequality in time O(n2)
Separation of Clique InequalitiesSeparation of Clique InequalitiesPROOF (sketch)
1) Clique = zigzag path
Separation of Clique InequalitiesSeparation of Clique InequalitiesPROOF (sketch)
1) Clique = zigzag path
1 2 3 4 5 6 7 8
Separation of Clique InequalitiesSeparation of Clique InequalitiesPROOF (sketch)
1) Clique = zigzag path 2) Flip one graph: zigzag leftright
1 2 3 4 5 6 7 8 8 7 6 5 4 3 2 1
Separation of Clique InequalitiesSeparation of Clique InequalitiesPROOF (sketch)
1) Clique = zigzag path 2) Flip one graph: zigzag leftright
1 2 3 4 5 6 7 8 8 7 6 5 4 3 2 1
3) Define a grid with lengths for arcs so that length(P) = x*(clique(P)). Use Dyn. Progr.to find longest path in grid, time O(n^2)
Separation of cliquesSeparation of cliques
n2
1n11 2
2
i
u
Create n1 x n2 gridOrient all edges and give weights
Separation of cliquesSeparation of cliques
n2
1n11 2
2
i
u
Create n1 x n2 gridOrient all edges and give weights
x*iu
x*iu
Separation of cliquesSeparation of cliques
Create n1 x n2 gridOrient all edges and give weightsThere is violated clique iff longest A,B path has length > 1
A=(1,n2)
B=(n1,1)
Gx is a Perfect GraphGx is a Perfect Graph
We show why polynomial separation is possible:
Gx is weakly triangulated (no chordless cycles >= 5 in Gx or Gx)
=> Gx is perfect (Hayward, 1985)
Gx is a Perfect GraphGx is a Perfect Graph
L1
L2
L3
L4
L7
L6
L5
PROOF (Sketch, for Gx)
L1 and L3 don’t cross. Wlog RIGHT(L3, L1)
Gx is a Perfect GraphGx is a Perfect Graph
L1
L2
L3
L4
L7
L6
L5L1 L3
L1 and L3 don’t cross. Wlog RIGHT(L3, L1)
Gx is a Perfect GraphGx is a Perfect Graph
L1
L2
L3
L4
L7
L6
L5L1 L3
For i=4,5,… Li crosses Li-1 but not L1
=> RIGHT (Li, L1)
Gx is a Perfect GraphGx is a Perfect Graph
L1
L2
L3
L4
L7
L6
L5L1 L3
For i=4,5,… Li crosses Li-1 but not L1
=> RIGHT (Li, L1)
L4
Gx is a Perfect GraphGx is a Perfect Graph
L1
L2
L3
L4
L7
L6
L5
For i=4,5,… Li crosses Li-1 but not L1
=> RIGHT (Li, L1)
L1
L4
L5
Gx is a Perfect GraphGx is a Perfect Graph
L1
L2
L3
L4
L7
L6
L5
For i=4,5,… Li crosses Li-1 but not L1
=> RIGHT (Li, L1)
L1 L5L6
Gx is a Perfect GraphGx is a Perfect Graph
L1
L2
L3
L4
L7
L6
L5L1
We get LEFT(L1, {L3, L4, L5, L6})
L3, L4, L5 L6
L6
Gx is a Perfect GraphGx is a Perfect Graph
L1
L2
L3
L4
L7
L6
L5L1
A symmetric argument started at L6, with LEFT(L1, L6) implies LEFT(Li, L6) for i=2,3,4,5
L3, L4, L5 L6
L6
Gx is a Perfect GraphGx is a Perfect Graph
L1
L2
L3
L4
L7
L6
L5L1
A symmetric argument started at L6, with LEFT(L1, L6) implies LEFT(Li, L6) for i=2,3,4,5
L3, L4, L5 L6
L6
L2, L3, L4 L5
Gx is a Perfect GraphGx is a Perfect Graph
L1
L2
L3
L4
L7
L6
L5L1
Then {L3, L4, L5} are between L1 and L6
L3, L4, L5 L6
L6
L2, L3, L4 L5
Gx is a Perfect GraphGx is a Perfect Graph
L1
L2
L3
L4
L7
L6
L5L1
Then {L3, L4, L5} are between L1 and L6
L3, L4, L5 L6
L6
L2, L3, L4 L5
But L7 crosses L1 and L6, and so should cross them all !
L7
The approach just seen is due to Lancia, Carr, Istrail, Walenz (2001)It can be applied to small or moderate proteins (up to 80 residues/150 contacts)
In 2002, a new approach, by Caprara and Lancia, based on LAGRANGIANLAGRANGIANRELAXATIONRELAXATION. Approach borrowed from Quadratic Assignment. With newapproach we can solve important proteins (up to 150 residues/300 contacts)
What about Heuristics?What about Heuristics?E.g., genetic algorithms…E.g., genetic algorithms…
Genetic Algorithm OverviewGenetic Algorithm Overview
• A Population of candidate solutions thatevolve (improve) over time
• Recombination creates new candidate solutions viacrossover and mutation
Populationat time t
Populationat time t+1
Recombinationoperators
Evaluationfunction
CrossoverCrossover
• Crossover selects pieces from both parents and creates two offspring solutions
Blue Parent
Offspring
Red Parent
CrossoverCrossover
• Crossover selects pieces from both parents and creates two offspring solutions– Select a set of edges in one parent to copy to the child
CrossoverCrossover
• Crossover selects pieces from both parents and creates two offspring solutions– Select a set of edges in one parent to copy to the child
CrossoverCrossover
• Crossover selects pieces from both parents and creates two offspring solutions– Select a set of edges in one parent to copy to the child
– Copy as many edges as possible from the other parent
CrossoverCrossover
• Crossover selects pieces from both parents and creates two offspring solutions– Select a set of edges in one parent to copy to the child
– Copy as many edges as possible from the other parentThese edges conflict with existing
edges and are not copied
CrossoverCrossover
• Crossover selects pieces from both parents and creates two offspring solutions– Select a set of edges in one parent to copy to the child
– Copy as many edges as possible from the other parent
– Add random edges to fill any remaining space
CrossoverCrossover
• Crossover selects pieces from both parents and creates two offspring solutions– Select a set of edges in one parent to copy to the child
– Copy as many edges as possible from the other parent
– Add random edges to fill any remaining space
MutationMutation
• Mutation introduces small changes to existing solutions by shifting edge endpoints
MutationMutation
• Mutation introduces small changes to existing solutions by shifting edge endpoints– Select a set of endpoints to shift
MutationMutation
• Mutation introduces small changes to existing solutions by shifting edge endpoints– Select a set of endpoints to shift
MutationMutation
• Mutation introduces small changes to existing solutions by shifting edge endpoints– Select a set of endpoints to shift
This edge “fell off” theend of the contact map
and is removed
MutationMutation
• Mutation introduces small changes to existing solutions by shifting edge endpoints– Select a set of endpoints to shift
– Randomly add new edges
MutationMutation
• Mutation introduces small changes to existing solutions by shifting edge endpoints– Select a set of endpoints to shift
– Randomly add new edges
Computational ResultsComputational Results
Computational ResultsComputational Results
• 269 proteins– 70 -100 residues
– 80 to 140 contacts
• Picked 10,000 pairs of proteins out of 36046 possible
• Took a weekend on PC
• 500 were solved to optimality
• 2500 had a gap <= 10 contacts
Skolnick Clustering TestSkolnick Clustering Test
Skolnick ResultsSkolnick Results• Four Families
1 Flavodoxin-like fold Che-Y related
2 Plastocyanin
3 TIM Barrel
4 Ferratin
• alpha-beta
• 8 structures
• up to 124 residues
• 15-30% sequence similarity
• < 3Å RMSD
Skolnick ResultsSkolnick Results• Four Families
1 Flavodoxin-like fold Che-Y related
2 Plastocyanin
3 TIM Barrel
4 Ferratin
• beta
• 8 structures
• up to 99 residues
• 35-90% sequence similarity
• < 2Å RMSD
Skolnick ResultsSkolnick Results• Four Families
1 Flavodoxin-like fold Che-Y related
2 Plastocyanin
3 TIM Barrel
4 Ferratin
• alpha-beta
• 11 structures
• up to 250 residues
• 30-90% sequence similarity
• < 2Å RMSD
Skolnick ResultsSkolnick Results• Four Families
1 Flavodoxin-like fold Che-Y related
2 Plastocyanin
3 TIM Barrel
4 Ferratin
• alpha
• 6 structures
• up to 170 residues
• 7-70% sequence similarity
• < 4Å RMSD
Skolnick ResultsSkolnick Results
Family Style Residues Seq. Sim. RMSD Proteins1 alpha-beta 124 15-30% < 3A 1b00, 1dbw, 1nat, 1ntr,
1qmp, 1rnl, 3cah, 4tmy2 beta 99 35-90% < 2A 1baw, 1byo, 1kdi, 1nin,
1pla, 3b3i, 2pcy, 2plt3 alpha-beta 250 30-90% < 2A 1amk, 1aw2, 1b9b, 1btm,
1hti, 1tmh, 1tre, 1tri,1ydv, 3ypi, 8tim
4 170 7-70% < 4A 1b71, 1bcf, 1dps, 1fha,1ier, 1rcd
• Four Families1 Flavodoxin-like fold Che-Y related
2 Plastocyanin
3 TIM Barrel
4 Ferratin
ClusteringClustering
Define score(P1, P2) as
0 <= # shared contacts
Min # of contacts of P1,P2
<= 1
Put P1, P2 in same family if score(P1, P2) >= threshold
ClusteringClustering
Define score(P1, P2) as
0 <= # shared contacts
Min # of contacts of P1,P2
<= 1
Put P1, P2 in same family if score(P1, P2) >= threshold
If P1, P2 too big, use G.A. and local search to compute score
L.P. gives then bounds:
HEUR score <= OPT score <= LP boundHEUR score <= OPT score <= LP bound
and we know how far off OPT we are
Clustering validationClustering validation
We got some known families from biologists, PDB.
Experiment: Take a family F of proteins and align them against each other and against the remaining.
Clustering validationClustering validation
We got some known families from biologists, PDB.
0.05 MISMATCH0.1 MISMATCH0.15 MISMATCH0.2 MISMATCH0.25 MISMATCH0.3 MISMATCH0.35 MATCH…… ……1.0 MATCH
score proteins were…
Experiment: Take a family F of proteins and align them against each other and against the remaining.
TYPICAL BEHAVIOUR
Skolnick ResultsSkolnick Results
• Performance– 528 alignments
– 1.3% false negative
– 0.0% false positive
ClusteringClustering
Computed, for 1st time, provably optimal alignments for 150 pairs(inter-family)
Used the CMO value to cluster: retrieves the clusters.
Set S(i,j) = 1 if CMO >= , S(i,j) = 0 otherwise
Use TSP to find a block diagonal structure for S
ClusteringClustering
Last Open ProblemLast Open Problem
? ?
top related