optimization problems for polymorphisms of single nucleotides

Optimization Problems for Optimization Problems for

Polymorphisms of Single Polymorphisms of Single NucleotidesNucleotides

PolymorphismsPolymorphisms

A polymorphism is a feature

A polymorphism is a feature - common to everybody

A polymorphism is a feature - common to everybody - not identical in everybody

A polymorphism is a feature - common to everybody - not identical in everybody- the possible variants (alleles) are just a few

E.g. think of eye-coloreye-color

Or blood-typeblood-type for a feature not visible from outside

At DNA level, a polymorphism is a sequence of nucleotidesvarying in a population.

The shortest possible sequence has only 1 nucleotide, hence

SSingle NNucleotide PPolymorphism (SNP)

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacgtac

atcggcttagttagggcacaggacgtac

atcggcttagttagggcacaggacggac

- SNPs are predominant form of human variations

- Used for drug design, study disease, forensic, evolutionary...

- On average one every 1,000 bases

- Multimillion dollar SNP consortium project

- Goal: associate SNPs (or group of SNPs) to genetic diseases

- 1st step: build maps of several thousand SNPs

HOMOZYGOUSHOMOZYGOUS: same allele on both chromosomes

HETEROZYGOUSHETEROZYGOUS: different alleles

HAPLOTYPEHAPLOTYPE: chromosome content at SNP sites

atcggattagttagggcacaggacgt

GENOTYPEGENOTYPE: “union” of 2 haplotypes

OaE OaOt

CHANGE OF SYMBOLSCHANGE OF SYMBOLS: each SNP only two values in a poplulation (bio).

Call them 1 and O. Also, call * the fact that a site is heterozygous

HAPLOTYPEHAPLOTYPE: string over 1,OGENOTYPEGENOTYPE: string over 1,O,*

CHANGE OF SYMBOLSCHANGE OF SYMBOLS: each SNP only two values in a poplulation (bio).

Call them 1 and O. Also, call * the fact that a site is heterozygous

HAPLOTYPEHAPLOTYPE: string over 1,OGENOTYPEGENOTYPE: string over 1,O,*

THE HAPLOTYPING PROBLEMTHE HAPLOTYPING PROBLEM

Single IndividualSingle Individual: Given genomic data of one individual, determine 2 haplotypes (one per chromosome)

Population Population : Given genomic data of k individuals, determine (at most) 2k haplotypes (one per chromosome/indiv.)

For the individual problem, input is erroneous haplotype data, from sequencing

For the population problem, data is ambiguous genotype data, from screening

OBJ is lead by Occam’s razor: find minimum explanation of observed data under given hypothesis (a.k.a. parsimony principle)

Theory and ResultsTheory and Results

- Polynomial Algorithms for gapless haplotyping (L, Bafna, Istrail, Lippert, Schwartz 01 & Bafna, L, Istrail, Rizzi 02)

- Polynomial Algorithms for bounded-length gapped haplotyping (BLIR 02)

Single individual

- NP-hardness for general gapped haplotyping (LBILS 01)

- APX-hardness (Gusfield 00)

- Reduction to Graph-Theoretic model and I.P. approach (Gusfield 01)

Population

-New formulations and Disease Detection (L, Ravi, Rizzi, 02)

- Exact algorithms for min-size solution (L,Serafini 2011)

- Heuristics (Tininini, L, Bertolazzi 2010)

The Single-IndividualThe Single-IndividualHaplotyping problemHaplotyping problem

TGAGCCTAG GATTT GCCTAG CTATCTT

ATAGATA GAGATTTCTAGAAATC ACTGA

TAGAGATTTC TCCTAAAGAT CGCATAGATA

fragmentation

sequencing

assembly

Shotgun Assembly of a Chromosome [ Webber and Myers, 1997]

ACTGCAGCCTAGAGATTCTCAGATATTTCTAGGCGTATCTATCTTACTGCAGCCTAGAGATTCTCAGATATTTCTAGGCGTATCTATCTTACTGCAGCCTAGAGATTCTCAGATATTTCTAGGCGTATCTATCTT

ACTGCAGCCTAGAGATTCTCAGATATTTCTAGGCGTATCTATCTT

-Sequencing errors:

ACTGCCTGGCCAATGGAACGGACAAG CTGGCCAAT CATTGGAAC AATGGAACGGA

-Contaminants

MAIN ERROR SOURCESMAIN ERROR SOURCES

Given errorserrors, the data may be inconsistentinconsistent with exactly 2 haplotypes

PROBLEMPROBLEM: Find and remove : Find and remove the errors so that the data the errors so that the data becomes consistent with becomes consistent with exactly 2 haplotypesexactly 2 haplotypes

Hence, assembler is unable Hence, assembler is unable to build 2 chromosomesto build 2 chromosomes

ACTGAAAGCGA ACTAGAGACAGCATGACTGATAGC GTAGAGTCAACTG TCGACTAGA CATGACTGA CGATCCATCG TCAGCACTGAAA ATCGATC AGCATGACTGAAAGCGA ACTAGAGACAGCATGACTGATAGC GTAGAGTCAACTG TCGACTAGA CATGACTGA CGATCCATCG TCAGCACTGAAA ATCGATC AGCATG 1 1 O O O 1 1 1 1 1 O

The data: a SNP matrix

Snips 1,..,n

1 2 3 4 5 6 7 8 9 1 - - - O X X O O - 2 - O - O X - - - X3 X X O X X - - - - 4 O O X - - - - O - 5 - - - - - - - X O6 - - - - O O O X -

Fragments 1,..,m

Snips 1,..,n

Fragments 1,..,m

Fragment conflict: can’t be on same haplotype

Snips 1,..,n

Fragments 1,..,m

Fragment conflict: can’t be on same haplotype

Fragment Conflict Graph GF(M)

We have 2 haplotypes iff GF is BIPARTITE

Snips 1,..,n

Fragments 1,..,m

PROBLEM (Fragment Removal): make GF Bipartite

Snips 1,..,n

Fragments 1,..,m

PROBLEM (Fragment Removal): make GF Bipartite

1 2 3 4 5 6 7 8 9 1 - - - O X X O O - 2 - O - O X - - - X4 O O X - - - - O -

3 X X O X X - - - -5 - - - - - - - X O

O O X O X X O O X

X X O X X - - X O

Removing fewest fragments is equivalent to maximum induced bipartite subgraph

NP-complete [Yannakakis, 1978a, 1978b; Lewis, 1978] O(|V|(log log |V|/log |V|)2)-approximable [Halldórsson, 1999] not O(|V|)-approximable for some [Lund and Yannakakis, 1993]

Are there cases of M for which GF(M) is easier?

YES: the gapless M

---OXXOO---OXOOX--- gap

---OXXOOXOXOXOOX--- gapless

---OXX--XO----OX--- 2 gaps

Why gaps?

Sequencing errors (don’t call with low confidence)

---OOXX?XX--- ===> ---OOXX-XX---

Celera’s mate pairs

attcgttgtagtggtagcctaaatgtcggtagaccttga

THEOREM

For a gapless M, the Min Fragment RemovalProblem is Polynomial

NOTENOTE: Does not need to be gapless. Enough if it can be sorted to become such (Consecutive Ones Property, Booth and Lueker, 1976)

An O(nm + n ) D.P. algoAn O(nm + n ) D.P. algo3

1 - O O X X O O - -2 - - X O X X O - -3 - - - X X O - - - 4 - - - - O O X O - 5 - - - - - X O X O

LFT(i) RGT(i)

sort according to LFT

LFT(i) RGT(i)

D(i;h,k) := min cost to solve up to row i, with k, h not removed and put in different haplotypes, and maximizing RGT(k), RGT(h)

sort according to LFT

D(i; h,k) =

D(i-1; h,k) if i, k compatible and RGT(i) <= RGT(k) or i, h compatible and RGT(i) <= RGT(h)

1 + D(i-1; h, k) otherwise{

OPT is min h,k D( n; h, k ) and can be found in time O(nm + n^3)

Th: NP-Hard if 2 gaps per fragment

proof: (simple) use fact that for every G there is M s.t. G = GF(M) and reduce from Max Bip. InducedSubgraph on 3-regular graphs

Th : NP-Hard if even 1 gap per fragment proof: technical. reduction from MAX2SAT

WITH GAPS…..WITH GAPS…..

But, gaps must be long for problem to be difficult.

We have O( 2 mn + 2 n ) D.P.

for MFR on matrix with total gaps length L

2L 3L 3

What for MFR with gaps? Why not ILP...What for MFR with gaps? Why not ILP...

min xff

xf >= 1 for all odd cycles Cf\in C

x \in {0,1}^n

min xff

x \in {0,1}^n

1/41/2

min xff

x \in {0,1}^n

1/41/2

min xff

x \in {0,1}^n

1/41/2

5/12 5/12

min xff

x \in {0,1}^n

1/41/2

5/12 5/12

min xff

x \in {0,1}^n

1/41/2

5/12 5/12

min xff

x \in {0,1}^n

1/41/2

5/12 5/12

Randomized rounding heuristic: round and repeat. Worked well at Celera

The fragment removal is good to get rid of contaminants.

However, we may want to keep all fragments andcorrect errors otherwise

A dual point of view is to disregard some SNPs and keepthe largest subset sufficient to reconstruct the haplotypes

All fragments get assigned to one of the two haplotypes.We describe the min SNP removal problem: remove the fewest number of columns from M so that the fragmentgraph becomes bipartite.

- - - O X X O O - - O X O X - - - XX X O X X - - - - O O X - - - O O - - - - - - - X X O- - - - O O O X -

SNP conflicts

CONFLICT !

SNP conflicts

CONFLICT !

SNP conflicts

SNP conflict graph GS(M)1 node for each SNP (column)edge between conflicting SNPs

1 2 3 4 5 6 7 8 9 - - - O X X O O - - O X O X - - - XX X O X X - - - - O O X - - - O O - - - - - - - X X O- - - - O O O X -

SNP conflicts

THEOREM 1

For a gapless M, GF(M) is bipartiteif and only if GS(M) is an independent set

THEOREM 2

For a gapless M, GS(M) is a perfect graph

COROLLARY

For a gapless M, the min SNP removalproblem is polynomial

THEOREM 1For a gapless M, GF(M) is bipartite if and only if

GS(M) is an independent set

PROOF (sketch): by minimal counterexample

--OOXXOO-------------OOXOOXOXXO-----------XXOXOXXX-----XXOOXOXXO-----------XOOOX-----------XXXXXO-------XXOXXOXOO------

Assume M gapless, GS(M) an independent set, but GF(M)not bipartite.

Take an odd cycle in GF

--O?X???-------------O????????O-----------??O??X??-----??????X??-----------???O?-----------????X?-------X???????O------

There is a generic structure of hor-vert cycle

--O?X???-------------O????????O-----------??O??X??-----??????X??-----------???O?-----------????X?-------X???????O------

“vertical lines”

There cannot be only one vertical line in odd cycle

We merge rightmost and next to reduce them by 1

Hence, there cannot be a minimal (in n. of vertical lines) counterexample

--O?X???-------------O????????O-----------??O??X??-----??????X??-----------???O?-----------????X?-------X???????O------

Must be X

--O?X???-------------O?????X??O-----------??O??X??-----??????X??-----------???O?-----------????X?-------X???????O------

Must be X

Merge the rightmost lines

--O?X???-------------O?????X--------------??O----------??????X-------------???O------------????X--------X???????O------

Still a counterexample!

Merge the rightmost lines

1 2 31 O - O 2 - O X 3 X X -

Note: Theorem not true if there are gaps

GF(M) GS(M)

THEOREM 2For a gapless M, GS(M) is a perfect graph

PROOF: GS(M) is the complement of a comparability graph A

Comparability graphs are perfect

Comparability Graphs: unoriented that can be oriented to become a partial order

LEMMA: If i<j<k and (i,k) is a SNP conflict then either (i,k) or (j,k) is also a SNP conflict

i j k - X O O ? X O X - - O X O ? X X X -

Equal:conflicts with i

Different:conflicts with k

I.e. if (i,j) is not a conflict and (j,k) is not a conflict, also (i,k) is not a conflict

So (u,v) with u < v and u not a conflict with v is a comparability graph Aand GS is A complement

NOTE: ind set on perfect graph is in P (Lovasz, Schrijvers, Groetschel, 84)

THEOREM: The min SNP removal is NP-hard if there can be gaps (Reduction from MAXCUT)

Again, gaps must be long for problem to be difficult.

We have O(mn + n ) D.P.

for MSR on matrix with total gaps length L

2L + 1 2L + 2

Hence gapless MSR is polynomial (max stable set on perfect graph).

There are better, D.P., algorithms, O(mn + m^2)

What if gaps ?

The PopulationThe PopulationHaplotyping problemHaplotyping problem

The input is GENOTYPE data

INPUT: G = { xx??x, ????x, xxoxx, ?x??x, oooxx }

The input is GENOTYPE data

xxoxxxxxox

oooxxxxxox

xxoxxoxxox

xxoxxxxoxx

oooxxoooxx

OUTPUT: H = { xxoxx, xxxox, oooxx, oxxox}

INPUT: G = { xx??x, ????x, xxoxx, ?x??x, oooxx }

Each genotype is explained by two haplotypes

We will define some objectives for H

1st Objective1st Objective (open research problem):

minimize |H|

2nd Objective2nd Objective based on inference rule:

1st Objective (parsimony)1st Objective (parsimony) :

minimize |H|

An easy SQRT(n) approximation: k haplotypes can explain at most k(k-1)/2 genotypes, hence, we need at least LB = SQRT(n) haplotypes.

BUT any greedy algorithm can find 2 haplotypes to explain a genotype, giving asolution of <= 2n haplotypes, i.e. <= SQRT(n) * LB

It’s difficult, but not impossible, to come up with better approximations, like constants(Lancia, Pinotti, Rizzi ’02)

2nd Objective2nd Objective based on inference rule:

xoxxooxoxx +********** =x??xoox?x?

known haplotype h

known (ambiguos) genotype g

Inference RuleInference Rule

xoxxooxoxx +xxoxooxxxo =x??xoox?x?

known haplotype h

new (derived) haplotype h’

xoxxooxoxx +xxoxooxxxo =x??xoox?x?

known haplotype h

new (derived) haplotype h’

We write h + h’ = g

g and h must be compatible to derive h’

2nd Objective (Clark, 1990)2nd Objective (Clark, 1990)

1. Start with H = nonambiguos genotypes2. while exists ambiguos genotype g in G3. take h in H compatible with g and let h + h’ = g4. set H = H + {h’} and G = G - {g}5. end while

If, at end, G is empty, SUCCESS, otherwise FAILURE

Step 3 is non-deterministic

ooooxooo??ooxx??

xxoo xxxx SUCCESS

ooooxooo??ooxx??

oxoo FAILURE (can’t resolve xx?? )

OBJ: find order of application rule that leaves the fewest elements in GOBJ: find order of application rule that leaves the fewest elements in G

- Problem is APX-hard (Gusfield,00)

- Graph-Model + Integer Programming for practical solution (G.,01)

1. expand genotypes

2. create (h, h’) if exists g s.t. h’ can bederived from g and h

1. expand genotypes 3. Largest number of nodes in forest

rooted at unambiguos genotpes = = largest number of ambiguous genotypes resolved

Hence, find largest number of nodes in forest rooted at unambiguos genotpes. Use I.P. model with vars x(ij).

This reduction is exponential. Is there a better practical approach?

3rd Objective3rd Objective (open research problem)Disease Detection:

INPUT: G = { xx??x, ????x, ??oxx, ?x??x, oooxx }

3rd Objective3rd Objective (open research problem)Disease Detection:

xxoxxxxxox

oooxxxxxox

xxoxxoxxox

xxoxxoooxx

oooxxoooxx

OUTPUT: H = { xxoxx, xxxox, oooxx, oxxox}

H contains H’, s.t. each diseased has one haplotype in H’ and each healty none

minimize | H|

INPUT: G = { xx??x, ????x, ??oxx, ?x??x, oooxx }

Genome Rearrangements and Genome Rearrangements and Evolutionary DistancesEvolutionary Distances

Each species has a genome (organized in pairs of chromosomes)

tcgtgatggat………………ttgatggattga

tcgattatggat………………ttttgatatcca

Genomes evolve by means of

•Insertions•Deletions•Inversions•Transpositions•Translocations

of DNA regions

deletion

deletioninsertion

translocation

deletioninsertion

translocation

inversion

deletioninsertion

translocation

inversion

transposition

Combinatorial problem: given 2 permutations P, Q and operators in a set F find ashortest sequence f1, ..fk of operators such that Q = fk(fk-1(…(f1(P))))

Very difficult problem! We focus on operators all of the same type (e.g. inversions)(…still difficult…)

Wlog we can take Q = (1 2 … n). Hence we talk of sorting by … (inversions, transpositions…)

5 6 4 8 3 2 1 9 7Example:

We focus on inversions, that are the most important in Nature

1 2 3 8 4 6 5 9 7

1 2 3 8 4 5 6 9 7

1 2 3 6 5 4 8 9 7

1 2 3 6 5 4 8 7 9

1 2 3 4 5 6 8 7 9

1 2 3 4 5 6 7 8 9

Combinatorial problem: given 2 permutations P, Q and operators in a set F find ashortest sequence f1, ..fk of operators such that Q = fk(fk-1(…(f1(P))))

Very difficult problem! We focus on operators all of the same type (e.g. inversions)(…still difficult…)

Wlog we can take Q = (1 2 … n). Hence we talk of sorting by … (inversions, transposition…)

+5 +6 -4 -8 -3 -2 -1 -9 +7Example:

We focus on inversions, that are the most important in Nature

+1 +2 +3 +8 +4 -6 -5 -9 +7

+1 +2 +3 +8 +4 +5 +6 -9 +7

+1 +2 +3 -6 -5 -4 -8 -9 +7

+1 +2 +3 -6 -5 -4 -8 -7 +9

+1 +2 +3 +4 +5 +6 -8 -7 +9

+1 +2 +3 +4 +5 +6 +7 +8 +9

There is also a SIGNED VERSION of the problem !

(Unsigned) Sorting by Inversions is NP-hard (longstanding question, settled by Caprara ‘98)

Surprisingly, Signed Sorting by Inversions is Polynomial (beautiful theory, by Hannenhalli and Pevzner)

The complexity of Sorting by Transpositions, e.g., is unknown

5 7 8 2 1 4 3 6 9

The concept of breakpoint

reakpoint at position i if(i) - (i+1) | > 1

5 7 8 2 1 4 3 6 9

The concept of breakpoint

reakpoint at position i if(i) - (i+1) | > 1

d() = inversion distanceb() = # breakpoints

TRIVIAL BOUND: d() >= b() / 2

Example: d() >= 6 / 2 = 3

The Breakpoint GraphBreakpoint Graph

5 7 8 2 1 4 3 6 9 0

Each node has degree...

0 2 or 4 …

hence the graph can be decomposed in cycles!

5 7 8 2 1 4 3 6 9 0

Alternating cycle decomposition

5 7 8 2 1 4 3 6 9 0

c() = max # cycles in alternating decomposition

VERY STRONG BOUND : d () >= b() - c()

Example: c()= 2 and d () >= 6 - 2 = 4

5 7 8 2 1 4 3 6 9 0

The best algorithm for this problem is based on an Integer Programmingformulation of the max cycle decomposition

A variable xC for each cycle (exponential # of vars…)

A constraint xC = 1 for each edge e

Objective: maximize C xC

C containing e

max xCC

xC = 1 for all edges eC\ni e

xC \in {0,1} for all alt. cycles C

PRIMAL

min yee

ye <= 1 for all alt. Cycles Ce\in C

ye \in R for all edges e

max xCC

xC = 1 for all edges eC\ni e

xC \in {0,1} for all alt. cycles C

PRIMAL

min yee

ye <= 1 for all alt. Cycles Ce\in C

ye \in R for all edges e

5 7 8 2 1 4 3 6 9 0

Pricing out the cycles for which y*(C) < 1Pricing out the cycles for which y*(C) < 1

5 7 8 2 1 4 3 6 9 0

Split the graph in two copiesSplit the graph in two copies

5 7 8 2 1 4 3 6 9 0

Connect twinsConnect twins

5 7 8 2 1 4 3 6 9 0

A perfect matching corresponds to (a set of) alternating cyclesA perfect matching corresponds to (a set of) alternating cycles

5 7 8 2 1 4 3 6 9 0

The weight of the matching is the y*-weight of the cyclesThe weight of the matching is the y*-weight of the cycles

5 7 8 2 1 4 3 6 9 0

Forcing a cycle to use a certain nodeForcing a cycle to use a certain node

100000

- These cycles would not use the same node twice, but with simple trick is possible to model (OMISSIS)

BRANCH&PRICE algorithm by Caprara, Lancia, Ng (1999,2001)

BRANCH&BOUND combinatorial algorithm by Kececioglu, Sankoff (1996)

KS can solve at most n=40. Take days for n=50

CLN can solve for n=200. Takes few seconds (say 5) for n=100

NP-hard problem practically solved to optimality!

Statistical view of evolutionStatistical view of evolution

• Genome evolve by random inversions

• It’s like a random walk on a huge graph with an edge for

each permutation an edge for each inversion

• It is not clear why the shortest solution should be the

one followed by Nature (in fact, often it isn’t)

• We want to find the most likely number of inversions

that lead from (1 2 … n ) to

• We use the expected number of breakpoints after k

inversions as a way to guess the # of inversions

Let B(k) be the (r.v.) number of breakpoint after k random inversions from (1..n)

Given a obtained by h random inversions from (1 … n ) we want to estimate h

The inversion distance is only a lower bound: h >= d() but the gap could be big

We estimate E[B(k)]. Then, faced with some , we pick h such that E[B(h)] is as close as possible to b() (maximum likelihood). CL ,2000, have shown:

Question: estimate E[D(k)], the (r.v.) inversion distance after k random inversions

E[B(k)] = ( n - 1 ) ( 1 - ( ) )

n - 3n - 1

Example: n = 200, k (u.a.r. in 1…n) inversions

8 8 8 1619 19 19 3468 67 67 9869 73 68 10473 79 73 10985 91 83 12086 85 83 11587 90 84 119118 117 109 138184 184 135 168

k k’ d() b

Protein Structure Alignments: the Protein Structure Alignments: the Maximum Contact Map Overlap Maximum Contact Map Overlap

ProblemProblem

A ProteinProtein is a complex molecule with a primary, linear structure (a sequence of aminoacids) and a3-Dimensional structure (the protein fold).

Protein STRUCTURE determines its FUNCTION

For instance, the Drug Design problemcalls for constructing peptides with a 3Dshape complementary to a protein, so asto dock onto it.

Motivation:Motivation:Structure Alignment is Important for:

- Discovery of Protein Function (shape determines function)

- Search in 3D data bases

- Protein Classification and Evolutionary Studies

Problem: Problem: Align two 3D protein structures

Contact MapsContact Maps

Unfolded protein

CONTACT MAPSCONTACT MAPS

Unfolded protein

Folded protein = contacts

Unfolded protein

Contact map = graph

Unfolded protein

Contact map = graph

OBJECTIVE: align 3d folds of proteins = align contact maps

Contact Map AlignmentsContact Map Alignments

Non-crossing AlignmentsNon-crossing Alignments

Protein 1

Protein 2

non-crossing map of residues in protein 1 and protein 2

The value of an alignmentThe value of an alignment

Value = 3

Value = 3We want to maximize the value

NP-Hard

Integer Programming Integer Programming FormulationFormulation

0-1 VARIABLES

yef for e and f contacts

0-1 VARIABLES

yef + ye’f’ <= 1

CONSTRAINTS

0-1 VARIABLES

yef + ye’f’ <= 1

CONSTRAINTS

OBJECTIVE max ef yef

Independent Set ProblemIndependent Set ProblemIt’s just a huge max independent set problem in Gy:

• a node for each sharing • an edge for each pair of incompatible sharings

f’f’’

e’’

e’f’

e’’f’’

Independent Set ProblemIndependent Set ProblemIt’s just a huge max independent set problem in Gy:

• a node for each sharing • an edge for each pair of incompatible sharings

f’f’’

e’’

e’f’

e’’f’’

|Gy|=|E1|*|E2| (approximately 5000 for two proteins with 50 residues and 75 contacts each)

The best exact algorithm for independent set can solve for at most a few hundred nodes

Node to Node VariablesNode to Node VariablesNew variables x provide an easy check for the non-crossing conditions

NEW VARIABLES

xij for i and j residues

NEW VARIABLES

NEW CONSTRAINTS

xij + xi’j’ <= 1

NEW VARIABLES

y(ip)(jq) <= xij and y(ip)(jq) <= xpq

NEW CONSTRAINTS

xij + xi’j’ <= 1

Clique ConstraintsClique ConstraintsVariables x define a graph Gx:

• A node for each line• An edge between each pair of crossing lines

i’j’

Clique ConstraintsClique ConstraintsVariables x define a graph Gx:

• Gx is much smaller than Gy

• Gx has nice proprieties (it’s a perfect graph)• It’s easier to find large independent sets in Gx

• A node for each line• An edge between each pair of crossing lines

i’j’

Clique ConstraintsClique ConstraintsNon-crossing constraints can be extended to

CLIQUE CONSTRAINTS

xij <= 1[i,j] in M

For all sets M of mutually incompatible (i.e. crossing) lines

All clique constraints satisfied (and Gx perfect) imply a strong bound!

Structure of Maximal cliques in Structure of Maximal cliques in GGxx

1. Pick two subsets of same size

2. Connect them in a zig-zag fashion

3. Throw in all lines included in a zig or a zag

The result is a maximal clique in Gx

Separation of Clique InequalitiesSeparation of Clique Inequalities

Separation of Clique InequalitiesSeparation of Clique InequalitiesPROBLEM

There exist exponentially many such cliques (O(22n) inequalities).

We need to generate in polynomial time a clique inequality when needed,i.e., when violated by the current LP solution x*

x*ij > 1[i,j] in M

THEOREM

We can find the most violated clique inequality in time O(n2)

Separation of Clique InequalitiesSeparation of Clique InequalitiesPROOF (sketch)

1) Clique = zigzag path

1 2 3 4 5 6 7 8

1) Clique = zigzag path 2) Flip one graph: zigzag leftright

1 2 3 4 5 6 7 8 8 7 6 5 4 3 2 1

1) Clique = zigzag path 2) Flip one graph: zigzag leftright

1 2 3 4 5 6 7 8 8 7 6 5 4 3 2 1

3) Define a grid with lengths for arcs so that length(P) = x*(clique(P)). Use Dyn. Progr.to find longest path in grid, time O(n^2)

Separation of cliquesSeparation of cliques

1n11 2

Create n1 x n2 gridOrient all edges and give weights

1n11 2

Create n1 x n2 gridOrient all edges and give weights

Create n1 x n2 gridOrient all edges and give weightsThere is violated clique iff longest A,B path has length > 1

A=(1,n2)

B=(n1,1)

Gx is a Perfect GraphGx is a Perfect Graph

We show why polynomial separation is possible:

Gx is weakly triangulated (no chordless cycles >= 5 in Gx or Gx)

=> Gx is perfect (Hayward, 1985)

PROOF (Sketch, for Gx)

L1 and L3 don’t cross. Wlog RIGHT(L3, L1)

L5L1 L3

L1 and L3 don’t cross. Wlog RIGHT(L3, L1)

L5L1 L3

For i=4,5,… Li crosses Li-1 but not L1

=> RIGHT (Li, L1)

L5L1 L3

=> RIGHT (Li, L1)

L1 L5L6

We get LEFT(L1, {L3, L4, L5, L6})

L3, L4, L5 L6

A symmetric argument started at L6, with LEFT(L1, L6) implies LEFT(Li, L6) for i=2,3,4,5

L3, L4, L5 L6

A symmetric argument started at L6, with LEFT(L1, L6) implies LEFT(Li, L6) for i=2,3,4,5

L3, L4, L5 L6

L2, L3, L4 L5

Then {L3, L4, L5} are between L1 and L6

L3, L4, L5 L6

L2, L3, L4 L5

Then {L3, L4, L5} are between L1 and L6

L3, L4, L5 L6

L2, L3, L4 L5

But L7 crosses L1 and L6, and so should cross them all !

The approach just seen is due to Lancia, Carr, Istrail, Walenz (2001)It can be applied to small or moderate proteins (up to 80 residues/150 contacts)

In 2002, a new approach, by Caprara and Lancia, based on LAGRANGIANLAGRANGIANRELAXATIONRELAXATION. Approach borrowed from Quadratic Assignment. With newapproach we can solve important proteins (up to 150 residues/300 contacts)

What about Heuristics?What about Heuristics?E.g., genetic algorithms…E.g., genetic algorithms…

Genetic Algorithm OverviewGenetic Algorithm Overview

• A Population of candidate solutions thatevolve (improve) over time

• Recombination creates new candidate solutions viacrossover and mutation

Populationat time t

Populationat time t+1

Recombinationoperators

Evaluationfunction

CrossoverCrossover

• Crossover selects pieces from both parents and creates two offspring solutions

Blue Parent

Offspring

Red Parent

CrossoverCrossover

• Crossover selects pieces from both parents and creates two offspring solutions– Select a set of edges in one parent to copy to the child

CrossoverCrossover

– Copy as many edges as possible from the other parent

CrossoverCrossover

– Copy as many edges as possible from the other parentThese edges conflict with existing

edges and are not copied

CrossoverCrossover

– Add random edges to fill any remaining space

CrossoverCrossover

– Add random edges to fill any remaining space

MutationMutation

• Mutation introduces small changes to existing solutions by shifting edge endpoints

MutationMutation

• Mutation introduces small changes to existing solutions by shifting edge endpoints– Select a set of endpoints to shift

MutationMutation

This edge “fell off” theend of the contact map

and is removed

MutationMutation

– Randomly add new edges

MutationMutation

– Randomly add new edges

Computational ResultsComputational Results

• 269 proteins– 70 -100 residues

– 80 to 140 contacts

• Picked 10,000 pairs of proteins out of 36046 possible

• Took a weekend on PC

• 500 were solved to optimality

• 2500 had a gap <= 10 contacts

Skolnick Clustering TestSkolnick Clustering Test

Skolnick ResultsSkolnick Results• Four Families

1 Flavodoxin-like fold Che-Y related

2 Plastocyanin

3 TIM Barrel

4 Ferratin

• alpha-beta

• 8 structures

• up to 124 residues

• 15-30% sequence similarity

• < 3Å RMSD

2 Plastocyanin

3 TIM Barrel

4 Ferratin

• beta

• 8 structures

• < 2Å RMSD

2 Plastocyanin

3 TIM Barrel

4 Ferratin

• alpha-beta

• 11 structures

• < 2Å RMSD

2 Plastocyanin

3 TIM Barrel

4 Ferratin

• alpha

• 6 structures

• < 4Å RMSD

Skolnick ResultsSkolnick Results

Family Style Residues Seq. Sim. RMSD Proteins1 alpha-beta 124 15-30% < 3A 1b00, 1dbw, 1nat, 1ntr,

1qmp, 1rnl, 3cah, 4tmy2 beta 99 35-90% < 2A 1baw, 1byo, 1kdi, 1nin,

1pla, 3b3i, 2pcy, 2plt3 alpha-beta 250 30-90% < 2A 1amk, 1aw2, 1b9b, 1btm,

1hti, 1tmh, 1tre, 1tri,1ydv, 3ypi, 8tim

4 170 7-70% < 4A 1b71, 1bcf, 1dps, 1fha,1ier, 1rcd

• Four Families1 Flavodoxin-like fold Che-Y related

2 Plastocyanin

3 TIM Barrel

4 Ferratin

ClusteringClustering

Define score(P1, P2) as

0 <= # shared contacts

Min # of contacts of P1,P2

Put P1, P2 in same family if score(P1, P2) >= threshold

Define score(P1, P2) as

0 <= # shared contacts

Min # of contacts of P1,P2

Put P1, P2 in same family if score(P1, P2) >= threshold

If P1, P2 too big, use G.A. and local search to compute score

L.P. gives then bounds:

HEUR score <= OPT score <= LP boundHEUR score <= OPT score <= LP bound

and we know how far off OPT we are

Clustering validationClustering validation

We got some known families from biologists, PDB.

Experiment: Take a family F of proteins and align them against each other and against the remaining.

Clustering validationClustering validation

We got some known families from biologists, PDB.

0.05 MISMATCH0.1 MISMATCH0.15 MISMATCH0.2 MISMATCH0.25 MISMATCH0.3 MISMATCH0.35 MATCH…… ……1.0 MATCH

score proteins were…

Experiment: Take a family F of proteins and align them against each other and against the remaining.

TYPICAL BEHAVIOUR

Skolnick ResultsSkolnick Results

• Performance– 528 alignments

– 1.3% false negative

– 0.0% false positive

Computed, for 1st time, provably optimal alignments for 150 pairs(inter-family)

Used the CMO value to cluster: retrieves the clusters.

Set S(i,j) = 1 if CMO >= , S(i,j) = 0 otherwise

Use TSP to find a block diagonal structure for S

Last Open ProblemLast Open Problem

optimization problems for polymorphisms of single nucleotides

fewpolymorphismsa polymorphism

colora polymorphism

feature common

sequence of nucleotidesvarying

shortest possible sequence

possible variants alleles

outsideat dna level

associate snps

Documents

nucleotides- 13

genetic polymorphisms

nucleotides & enzymes

single nucleotide polymorphisms

biochem 22 [nucleotides]

bases and nucleotides

single nucleotide polymorphisms...

optimization problems for polymorphisms of single...

concussion - amazon s3...the science •nucleotides:...

interconversion uptakeof nucleotides, nucleosides, and...

nucleotides chemistry

biosynthesis of nucleotides

nucleotides revised

metabolism of nucleotides

types of polymorphisms i. protein/enzyme polymorphisms blood...

8| nucleotides nucleic...

genetic polymorphisms pptx

nucleotides metabolism

types, polymorphisms, & composition

sugars to nucleotides