fingerprint clustering with bounded number of missing values

27
Fingerprint Clustering - CPM 2006 1 Fingerprint Clustering with Bounded Number of Missing Values Paola Bonizzoni, Gianluca Della Vedova, Giancarlo Mauri Università di Milano-Bicocca, Italy Riccardo Dondi Università di Bergamo, Italy

Upload: bert

Post on 25-Feb-2016

49 views

Category:

Documents


3 download

DESCRIPTION

Fingerprint Clustering with Bounded Number of Missing Values. Paola Bonizzoni, Gianluca Della Vedova, Giancarlo Mauri Università di Milano-Bicocca, Italy Riccardo Dondi Università di Bergamo, Italy. Talk Outline. Biological problem and combinatorial problem Three versions of the problem: - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Fingerprint Clustering with Bounded Number of Missing Values

Fingerprint Clustering - CPM 2006

1

Fingerprint Clustering with Bounded Number of Missing Values

Paola Bonizzoni, Gianluca Della Vedova, Giancarlo MauriUniversità di Milano-Bicocca, Italy

Riccardo DondiUniversità di Bergamo, Italy

Page 2: Fingerprint Clustering with Bounded Number of Missing Values

Fingerprint Clustering - CPM 2006

2

Talk Outline Biological problem and combinatorial problem Three versions of the problem:

– Clustering with Missing Value (CMV)– Inside Edge Clustering (IEC)– Outside Edge Clustering (OEC)

Approximation algorithm for IEC and OEC Polynomial time algorithm for restricted CMV APX-hardness of CMV APX-hardness of IEC and OEC Future work

Page 3: Fingerprint Clustering with Bounded Number of Missing Values

Fingerprint Clustering - CPM 2006

3

Biological MotivationsClassification of microorganisms: A library of rDNA (ribosomal RNA clones) is

created A short DNA sequence (a probe) is applied to

hybridize with all clones of the library After hybridization unbounded probes are

removed; the library is analyzed to see how much any probe is hybridized to each spot

Experiment repeated for a set of probes

Page 4: Fingerprint Clustering with Bounded Number of Missing Values

Fingerprint Clustering - CPM 2006

4

Biological Motivations Fingerprint of a clone: vector consisting of

the hybridization intensity values between the clone and each probe

To classify microorganisms: Fingerprints are transformed in binary

vectors Clustering of fingerprints to infer different

properties with respect to the probes

Page 5: Fingerprint Clustering with Bounded Number of Missing Values

Fingerprint Clustering - CPM 2006

5

Biological Motivations Goal: translate hybridization intensity values into binary

values 0, 1. Due to the intensity values it is not always possible to get

binary vectors

For each clone we are given a fingerprint over alphabet {0,1,N}

• 0 → no hybridization

• 1 → hybridization

• N → unable to determine if a hybridization has happened

Page 6: Fingerprint Clustering with Bounded Number of Missing Values

Fingerprint Clustering - CPM 2006

6

Clustering of fingerprints – Combinatorial problem

Two fingerprints are compatible iff they agree in each position where they are different from N

Example: Two compatible fingerprints:

0 1 0 N N 0 1 00 1 N N 1 0 1 0

Two uncompatible fingerprints:0 1 0 N N 0 1 00 1 N N 1 0 0 0

Page 7: Fingerprint Clustering with Bounded Number of Missing Values

Fingerprint Clustering - CPM 2006

7

Clustering of fingerprints – Combinatorial problem

Clustering of fingerprints: general formulation

Input: a set F of fingerprints Output: clustering (partition) C of fingerprints

such that each cluster of C contains only compatible fingerprints

Page 8: Fingerprint Clustering with Bounded Number of Missing Values

Fingerprint Clustering - CPM 2006

8

Clustering of fingerprints – Combinatorial problem

An example F:

f1= 0 1 0 N f2= 0 N 0 1 f3= N 1 0 0 f4= 1 N N 1

Compatibility: f1 and f2; f1 and f3

Some possible solutions: – (f1= 010N, f2= 0N01), (f3= N100), (f4= 1NN1)– (f1= 010N, f3= N100), (f2= 0N01), (f4= 1NN1)

Page 9: Fingerprint Clustering with Bounded Number of Missing Values

Fingerprint Clustering - CPM 2006

9

Clustering of fingerprints – Three versions of the problem

Three combinatorial versions of the problem with different objective functions

CMV (Clustering with Missing Values): minimize the number of clusters

IEC (Inside Edge Clustering with missing values): maximize the number of co-clustered pairs of fingerprints

OEC (Outside Edge Clustering with missing values): minimize the number of pairs of compatible fingerprints assigned to different clusters

Page 10: Fingerprint Clustering with Bounded Number of Missing Values

Fingerprint Clustering - CPM 2006

10

CMV- An example

CMV: minimize number of clustersF = {f1= 01NN, f2= 0NN1, f3= 0N00, f4= 00N1}Compatibility: f1 compatible with f2, f1 compatible with f3, f2 compatible with f4

– A solution: (f1= 01NN, f2= 0NN1), (f3= 0N00), (f4= 00N1) → size 3

– Optimum: (f1= 01NN, f3= 0N00), (f2= 0NN1, f4= 00N1) → size 2

Page 11: Fingerprint Clustering with Bounded Number of Missing Values

Fingerprint Clustering - CPM 2006

11

IEC- An example

IEC: maximize the number of co-clustered pairsF = {f1= 01NN, f2= 0NN1, f3= 0N00, f4= 00N1}Compatibility: f1 compatible with f2, f1 compatible with f3, f2 compatible with f4

– A solution: (f1= 01NN, f2= 0NN1), (f3= 0N00), (f4= 00N1) → size 1: pair (f1 ,f2) co-clustered

– Optimum: (f1= 01NN, f3= 0N00), (f2= 0NN1, f4= 00N1) → size 2: pairs (f1 ,f3) and (f2 ,f4) co-clustered

Page 12: Fingerprint Clustering with Bounded Number of Missing Values

Fingerprint Clustering - CPM 2006

12

OEC- An example

OEC: minimize the number of compatible not co-clustered pairs

F = {f1= 01NN, f2= 0NN1, f3= 0N00, f4= 00N1}Compatibility: f1 compatible with f2, f1 compatible with f3, f2 compatible with f4

– A solution: (f1= 01NN, f2= 0NN1), (f3= 0N00), (f4= 00N1) → size 2; pair (f1 ,f3) and (f2 ,f4) not co-clustered

– Optimum: (f1= 01NN, f3= 0N00), (f2= 0NN1, f4= 00N1) → size 1; pair (f1 ,f2) not co-clustered

Page 13: Fingerprint Clustering with Bounded Number of Missing Values

Fingerprint Clustering - CPM 2006

13

Parameterized versions

We consider parameterized versions of the problem: number of N’s is our parameter p

CMV(p), IEC(p), OEC(p) when fingerprints have at most p positions with value N.

Page 14: Fingerprint Clustering with Bounded Number of Missing Values

Fingerprint Clustering - CPM 2006

14

Parameterized versions

Resolution of a fingerprint f: a vector over {0,1} that is compatible with f

Example: f = 01NN10Possible resolutions:

01 00 10 01 01 10 01 10 10 01 11 10

Page 15: Fingerprint Clustering with Bounded Number of Missing Values

Fingerprint Clustering - CPM 2006

15

Parameterized versions

For each fingerprint with p N’s: 2p possible resolutions

Reformulation of the problem: given a set of fingerprints and the corresponding set S of resolved vectors, assign each fingerprint f to exactly one of its resolutions in S in order to optimize the objective function

Page 16: Fingerprint Clustering with Bounded Number of Missing Values

Fingerprint Clustering - CPM 2006

16

Previous results

CMV(p): NP-hard for p ≥ 2 [Figueroa et al., CATS 2005] Poly-time for p = 1 [Figueroa et al., J of Comp. Biology 2004] Approximation algorithm with factor min(1 + log n, 2 + p log l) [Figueroa

et al., CATS 2005]

IEC(p): Approximation algorithm with factor 22p−1 [Figueroa et al., CATS 2005] for

any p =O(log n)

OEC(p) Approximation algorithm with factor 2(1-1/2p) for restricted instances

[Figueroa et al., CATS 2005]

Page 17: Fingerprint Clustering with Bounded Number of Missing Values

Fingerprint Clustering - CPM 2006

17

Approximation algorithm for OEC(p) and IEC(p)

Greedy Algorithm:

WHILE (there exists a not assigned fingerprint)1. select a resolved vector that resolves the maximum

number of fingerprints2. Delete the assigned fingerprintsENDWHILE

2-factor approximation ratio for OEC ½ -factor approximation ratio for IEC

Page 18: Fingerprint Clustering with Bounded Number of Missing Values

Fingerprint Clustering - CPM 2006

18

A tight example for IEC

f1 = N001; f2 = 0N01; f3 = 01N1; f4 = 011N;

f1 compatible with f2, f2 compatible with f3, f3 compatible with f4

Resolved vectors associated with compatibilityr12 = 0001; r23 = 0100; r34 = 0111Each of these resolved vectors resolves two fingerprints

Page 19: Fingerprint Clustering with Bounded Number of Missing Values

Fingerprint Clustering - CPM 2006

19

A tight example for IEC

The algorithm chooses one resolved vector, for example r23;f2 and f3 are assigned to r23 and deleted; r12 is chosen, f1 is assigned to it and deleted;r34 is chosen and f4 is assigned to it and deleted;

Number of compatible co-clustered pairs: 1

The optimal solution consists of:r12; f1 and f2 are assigned to r12; r34; f3 and f4 are assigned to r34;

Number of compatible co-clustered pairs in the optimal solution: 2

Page 20: Fingerprint Clustering with Bounded Number of Missing Values

Fingerprint Clustering - CPM 2006

20

A Polynomial Time Algorithm for Restricted CMV

Restricted CMVfor each position j there is at most one fingerprint

having a value N in j-th positionAn instance of restricted CMV

f1 = NN 01 01 01; f2 = 01 NN 01 01; f3 = 01 11 NN 01; f4 = 01 11 11 NN

Page 21: Fingerprint Clustering with Bounded Number of Missing Values

Fingerprint Clustering - CPM 2006

21

A Polynomial Time Algorithm for Restricted CMV

Two interesting properties of restricted CMV:1. the interesting resolved vectors are at most

n2 (interesting resolved vectors: resolve more than one fingerprint);

2. there is a fingerprint (private fingerprint) which is resolved by one interesting resolved vector;

The algorithm at each step selects the interesting resolved vector that resolves a private fingerprint

Page 22: Fingerprint Clustering with Bounded Number of Missing Values

Fingerprint Clustering - CPM 2006

22

APX-hardness of CMV(2)L-reduction from MIN Vertex Cover on cubic graphs (APX-hard

[Alimonti et., TCS 2000])G=(V, E) cubic graph → graph gadget GA=(VA, EA) For each vi in V define the following gadget GVi

Two possible vertex cover of the gadget:

type 1: suboptimal

type 2: optimal

GVi

Page 23: Fingerprint Clustering with Bounded Number of Missing Values

Fingerprint Clustering - CPM 2006

23

APX-hardness of CMV(2)

G=(V, E) cubic graph to graph gadget GA=(VA, EA) For each edge (vi, vj ) in E define the edge gadget EGij

1. Four vertices covered in EGij → GVi and GVj both optimal

2. Two vertices covered in EGij → GVi or GVj suboptimal

Case 2 is always better than case 1

GViGVj

EGij

Page 24: Fingerprint Clustering with Bounded Number of Missing Values

Fingerprint Clustering - CPM 2006

24

APX-hardness of CMV(2)

Instance of CMV(2) is built as follows: a resolved vector is built for each vertex of

the gadgets a fingerprint is built for each edge of the

gadgets two fingerprints share a common resolution iff

they are incident on a common vertex

Page 25: Fingerprint Clustering with Bounded Number of Missing Values

Fingerprint Clustering - CPM 2006

25

APX-hardness of IEC(2) and OEC(2)

L-reduction from MAX Independent Set on cubic graphs (APX-hard [Alimonti et., TCS 2000])

Similar to the reduction for CMV(2) G=(V,E) a cubic graph;

– for each vertex vi in V a set Fi of 9 fingerprints – for each edge (vi , vj ) a fingerprint fij

Page 26: Fingerprint Clustering with Bounded Number of Missing Values

Fingerprint Clustering - CPM 2006

26

Open Problems Approximation of CMV(p):

– constant factor not dependant on p? – improve min(1 + log n, 2 + p log l)

approximation factor Approximation of IEC(p) and OEC(p):

– improve approximation factors ½ and 2 Restricted versions of IEC and OEC are in P?

Page 27: Fingerprint Clustering with Bounded Number of Missing Values

Fingerprint Clustering - CPM 2006

27

Conclusions Biological problem and combinatorial problem Three versions

– Clustering with Missing Value (CMV)– Inside Edge Clustering (IEC)– Outside Edge Clustering (OEC)

Approximation algorithms for IEC(p) and OEC(p) Polynomial time algorithm for restricted CMV APX-hardness of CMV(2) APX-hardness of IEC(2) and OEC(2) Future work