fingerprint clustering with bounded number of missing values
DESCRIPTION
Fingerprint Clustering with Bounded Number of Missing Values. Paola Bonizzoni, Gianluca Della Vedova, Giancarlo Mauri Università di Milano-Bicocca, Italy Riccardo Dondi Università di Bergamo, Italy. Talk Outline. Biological problem and combinatorial problem Three versions of the problem: - PowerPoint PPT PresentationTRANSCRIPT
Fingerprint Clustering - CPM 2006
1
Fingerprint Clustering with Bounded Number of Missing Values
Paola Bonizzoni, Gianluca Della Vedova, Giancarlo MauriUniversità di Milano-Bicocca, Italy
Riccardo DondiUniversità di Bergamo, Italy
Fingerprint Clustering - CPM 2006
2
Talk Outline Biological problem and combinatorial problem Three versions of the problem:
– Clustering with Missing Value (CMV)– Inside Edge Clustering (IEC)– Outside Edge Clustering (OEC)
Approximation algorithm for IEC and OEC Polynomial time algorithm for restricted CMV APX-hardness of CMV APX-hardness of IEC and OEC Future work
Fingerprint Clustering - CPM 2006
3
Biological MotivationsClassification of microorganisms: A library of rDNA (ribosomal RNA clones) is
created A short DNA sequence (a probe) is applied to
hybridize with all clones of the library After hybridization unbounded probes are
removed; the library is analyzed to see how much any probe is hybridized to each spot
Experiment repeated for a set of probes
Fingerprint Clustering - CPM 2006
4
Biological Motivations Fingerprint of a clone: vector consisting of
the hybridization intensity values between the clone and each probe
To classify microorganisms: Fingerprints are transformed in binary
vectors Clustering of fingerprints to infer different
properties with respect to the probes
Fingerprint Clustering - CPM 2006
5
Biological Motivations Goal: translate hybridization intensity values into binary
values 0, 1. Due to the intensity values it is not always possible to get
binary vectors
For each clone we are given a fingerprint over alphabet {0,1,N}
• 0 → no hybridization
• 1 → hybridization
• N → unable to determine if a hybridization has happened
Fingerprint Clustering - CPM 2006
6
Clustering of fingerprints – Combinatorial problem
Two fingerprints are compatible iff they agree in each position where they are different from N
Example: Two compatible fingerprints:
0 1 0 N N 0 1 00 1 N N 1 0 1 0
Two uncompatible fingerprints:0 1 0 N N 0 1 00 1 N N 1 0 0 0
Fingerprint Clustering - CPM 2006
7
Clustering of fingerprints – Combinatorial problem
Clustering of fingerprints: general formulation
Input: a set F of fingerprints Output: clustering (partition) C of fingerprints
such that each cluster of C contains only compatible fingerprints
Fingerprint Clustering - CPM 2006
8
Clustering of fingerprints – Combinatorial problem
An example F:
f1= 0 1 0 N f2= 0 N 0 1 f3= N 1 0 0 f4= 1 N N 1
Compatibility: f1 and f2; f1 and f3
Some possible solutions: – (f1= 010N, f2= 0N01), (f3= N100), (f4= 1NN1)– (f1= 010N, f3= N100), (f2= 0N01), (f4= 1NN1)
Fingerprint Clustering - CPM 2006
9
Clustering of fingerprints – Three versions of the problem
Three combinatorial versions of the problem with different objective functions
CMV (Clustering with Missing Values): minimize the number of clusters
IEC (Inside Edge Clustering with missing values): maximize the number of co-clustered pairs of fingerprints
OEC (Outside Edge Clustering with missing values): minimize the number of pairs of compatible fingerprints assigned to different clusters
Fingerprint Clustering - CPM 2006
10
CMV- An example
CMV: minimize number of clustersF = {f1= 01NN, f2= 0NN1, f3= 0N00, f4= 00N1}Compatibility: f1 compatible with f2, f1 compatible with f3, f2 compatible with f4
– A solution: (f1= 01NN, f2= 0NN1), (f3= 0N00), (f4= 00N1) → size 3
– Optimum: (f1= 01NN, f3= 0N00), (f2= 0NN1, f4= 00N1) → size 2
Fingerprint Clustering - CPM 2006
11
IEC- An example
IEC: maximize the number of co-clustered pairsF = {f1= 01NN, f2= 0NN1, f3= 0N00, f4= 00N1}Compatibility: f1 compatible with f2, f1 compatible with f3, f2 compatible with f4
– A solution: (f1= 01NN, f2= 0NN1), (f3= 0N00), (f4= 00N1) → size 1: pair (f1 ,f2) co-clustered
– Optimum: (f1= 01NN, f3= 0N00), (f2= 0NN1, f4= 00N1) → size 2: pairs (f1 ,f3) and (f2 ,f4) co-clustered
Fingerprint Clustering - CPM 2006
12
OEC- An example
OEC: minimize the number of compatible not co-clustered pairs
F = {f1= 01NN, f2= 0NN1, f3= 0N00, f4= 00N1}Compatibility: f1 compatible with f2, f1 compatible with f3, f2 compatible with f4
– A solution: (f1= 01NN, f2= 0NN1), (f3= 0N00), (f4= 00N1) → size 2; pair (f1 ,f3) and (f2 ,f4) not co-clustered
– Optimum: (f1= 01NN, f3= 0N00), (f2= 0NN1, f4= 00N1) → size 1; pair (f1 ,f2) not co-clustered
Fingerprint Clustering - CPM 2006
13
Parameterized versions
We consider parameterized versions of the problem: number of N’s is our parameter p
CMV(p), IEC(p), OEC(p) when fingerprints have at most p positions with value N.
Fingerprint Clustering - CPM 2006
14
Parameterized versions
Resolution of a fingerprint f: a vector over {0,1} that is compatible with f
Example: f = 01NN10Possible resolutions:
01 00 10 01 01 10 01 10 10 01 11 10
Fingerprint Clustering - CPM 2006
15
Parameterized versions
For each fingerprint with p N’s: 2p possible resolutions
Reformulation of the problem: given a set of fingerprints and the corresponding set S of resolved vectors, assign each fingerprint f to exactly one of its resolutions in S in order to optimize the objective function
Fingerprint Clustering - CPM 2006
16
Previous results
CMV(p): NP-hard for p ≥ 2 [Figueroa et al., CATS 2005] Poly-time for p = 1 [Figueroa et al., J of Comp. Biology 2004] Approximation algorithm with factor min(1 + log n, 2 + p log l) [Figueroa
et al., CATS 2005]
IEC(p): Approximation algorithm with factor 22p−1 [Figueroa et al., CATS 2005] for
any p =O(log n)
OEC(p) Approximation algorithm with factor 2(1-1/2p) for restricted instances
[Figueroa et al., CATS 2005]
Fingerprint Clustering - CPM 2006
17
Approximation algorithm for OEC(p) and IEC(p)
Greedy Algorithm:
WHILE (there exists a not assigned fingerprint)1. select a resolved vector that resolves the maximum
number of fingerprints2. Delete the assigned fingerprintsENDWHILE
2-factor approximation ratio for OEC ½ -factor approximation ratio for IEC
Fingerprint Clustering - CPM 2006
18
A tight example for IEC
f1 = N001; f2 = 0N01; f3 = 01N1; f4 = 011N;
f1 compatible with f2, f2 compatible with f3, f3 compatible with f4
Resolved vectors associated with compatibilityr12 = 0001; r23 = 0100; r34 = 0111Each of these resolved vectors resolves two fingerprints
Fingerprint Clustering - CPM 2006
19
A tight example for IEC
The algorithm chooses one resolved vector, for example r23;f2 and f3 are assigned to r23 and deleted; r12 is chosen, f1 is assigned to it and deleted;r34 is chosen and f4 is assigned to it and deleted;
Number of compatible co-clustered pairs: 1
The optimal solution consists of:r12; f1 and f2 are assigned to r12; r34; f3 and f4 are assigned to r34;
Number of compatible co-clustered pairs in the optimal solution: 2
Fingerprint Clustering - CPM 2006
20
A Polynomial Time Algorithm for Restricted CMV
Restricted CMVfor each position j there is at most one fingerprint
having a value N in j-th positionAn instance of restricted CMV
f1 = NN 01 01 01; f2 = 01 NN 01 01; f3 = 01 11 NN 01; f4 = 01 11 11 NN
Fingerprint Clustering - CPM 2006
21
A Polynomial Time Algorithm for Restricted CMV
Two interesting properties of restricted CMV:1. the interesting resolved vectors are at most
n2 (interesting resolved vectors: resolve more than one fingerprint);
2. there is a fingerprint (private fingerprint) which is resolved by one interesting resolved vector;
The algorithm at each step selects the interesting resolved vector that resolves a private fingerprint
Fingerprint Clustering - CPM 2006
22
APX-hardness of CMV(2)L-reduction from MIN Vertex Cover on cubic graphs (APX-hard
[Alimonti et., TCS 2000])G=(V, E) cubic graph → graph gadget GA=(VA, EA) For each vi in V define the following gadget GVi
Two possible vertex cover of the gadget:
type 1: suboptimal
type 2: optimal
GVi
Fingerprint Clustering - CPM 2006
23
APX-hardness of CMV(2)
G=(V, E) cubic graph to graph gadget GA=(VA, EA) For each edge (vi, vj ) in E define the edge gadget EGij
1. Four vertices covered in EGij → GVi and GVj both optimal
2. Two vertices covered in EGij → GVi or GVj suboptimal
Case 2 is always better than case 1
GViGVj
EGij
Fingerprint Clustering - CPM 2006
24
APX-hardness of CMV(2)
Instance of CMV(2) is built as follows: a resolved vector is built for each vertex of
the gadgets a fingerprint is built for each edge of the
gadgets two fingerprints share a common resolution iff
they are incident on a common vertex
Fingerprint Clustering - CPM 2006
25
APX-hardness of IEC(2) and OEC(2)
L-reduction from MAX Independent Set on cubic graphs (APX-hard [Alimonti et., TCS 2000])
Similar to the reduction for CMV(2) G=(V,E) a cubic graph;
– for each vertex vi in V a set Fi of 9 fingerprints – for each edge (vi , vj ) a fingerprint fij
Fingerprint Clustering - CPM 2006
26
Open Problems Approximation of CMV(p):
– constant factor not dependant on p? – improve min(1 + log n, 2 + p log l)
approximation factor Approximation of IEC(p) and OEC(p):
– improve approximation factors ½ and 2 Restricted versions of IEC and OEC are in P?
Fingerprint Clustering - CPM 2006
27
Conclusions Biological problem and combinatorial problem Three versions
– Clustering with Missing Value (CMV)– Inside Edge Clustering (IEC)– Outside Edge Clustering (OEC)
Approximation algorithms for IEC(p) and OEC(p) Polynomial time algorithm for restricted CMV APX-hardness of CMV(2) APX-hardness of IEC(2) and OEC(2) Future work