discovering substantial distinctions among incremental bi-clusters
TRANSCRIPT
Discovering Substantial Distinctions amongIncremental Bi-clusters
Faris AlqadahRaj Bhatnagar
Computer Science DepartmentUniversity of Cincinnati
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Discovering Substantial Distinctions among Incremental Bi-clusters 1 / 26
Outline
1 Introduction
2 Related Work
3 Problem Model
4 AlgorithmsAdapting Prims Algorithm
5 Experimental Results
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Discovering Substantial Distinctions among Incremental Bi-clusters 2 / 26
Introduction
Bi-Clustering
Bi-clustering in binary data has proven its utility in applicationssuch as bioinformatics , market basket data , and recommendersystems
Typically set of bi-clusters in a data set is large
Increasingly lattice structure is utilized to organize bi-clusterhierarchy
Many neighboring concepts in the concept lattice differ veryslightly and infact do not reveal much useful information
New discovery task: Find the sets of attributes/objects thatmost distinguish the bi-clusters from each other
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Discovering Substantial Distinctions among Incremental Bi-clusters 3 / 26
Introduction
Distinguishing sets
Introduce (dn sets)
Define distinguishing sets (dn sets) as difference between abi-cluster and an immediate parent in the lattice
Each edge in the lattice corresponds to two distinguishing sets
Key question: Which dn sets are most significant?
Grow maximum cost spanning tree
Contribution 1: Introduce concept of dn-sets
Contribution 2: Quantitative measure of distinction betweenincremental bi-clusters
Contribution 3: MIDS algorithm
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Discovering Substantial Distinctions among Incremental Bi-clusters 4 / 26
Introduction
A motivating example
Genes vs transcription factors
Comparing each concept with an immediate parent tells us thedifference in activation of genes / TFs
tf1 tf2 tf3 tf4
g1 1 0 1 1g2 1 1 1 0g3 0 1 0 1g4 1 0 0 0g5 1 0 0 1
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Discovering Substantial Distinctions among Incremental Bi-clusters 5 / 26
Introduction
A motivating example
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Discovering Substantial Distinctions among Incremental Bi-clusters 5 / 26
Introduction
A motivating example
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Discovering Substantial Distinctions among Incremental Bi-clusters 5 / 26
Related Work
Previous Approaches
Emerging patternsOnly considers ratio of support between frequent itemsetsSupervised technique
Contrast sets [Bay, Pazzani]Also supervisedSpecial case of rule discovery
Close item set algorithms [Zaki,Uno,Bian]EfficientDo not explicitly discover lattice structure
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Discovering Substantial Distinctions among Incremental Bi-clusters 6 / 26
Problem Model
Challenges
Enumerating bi-clusters and forming lattice is NP-hard
Discover dn-sets during the mining process as opposed to postprocessing step
How to quantify distinction
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Discovering Substantial Distinctions among Incremental Bi-clusters 7 / 26
Problem Model
Defining bi-clusters
Data set D = (O,A,R)
X arbitrary attributeset ψ(X ) is defined to beZ = {o ∈ O|xRo for all x ∈ X}
Dually defined for attributesets, ϕ(Y )
DefinitionA bi-cluster or formal concept of the data set D is a pair < X ,Y >
with X ⊆ A and Y ⊆ O such that ψ(X ) = Y and X = ϕ(Y )
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Discovering Substantial Distinctions among Incremental Bi-clusters 8 / 26
Problem Model
Bi-cluster lattice
Set of bi-clusters in a dataset form a complete lattice ordered byset inclusion
< X1,Y1 > subconcept of < X2,Y2 > provided that X1 ⊆ X2
(equivalently Y2 ⊆ Y1)
Denote < X1,Y1 >≤< X2,Y2 >.
Bi-cluster C1 is an upper neighbor of C2 if C2 ≤ C1 and there isno bi-cluster C3 fulfilling C2 ≤ C3 ≤ C1
DefinitionGiven two bi-clusters C1 =< X1,Y1 > and C2 =< X2,Y2 > such thatC1 ≻ C2 the distinguishing objectset between C1 and C2 is Y1 − Y2.The distinguishing attributeset between C1 and C2 is X2 − X1.
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Discovering Substantial Distinctions among Incremental Bi-clusters 9 / 26
Problem Model
Example
Bi-clusters can be viewed as maximal rectangles of 1s undersuitable permutation of rows and columns
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Discovering Substantial Distinctions among Incremental Bi-clusters 10 / 26
Problem Model
Example
tf1 tf2 tf3 tf4
g1 1 0 1 1g2 1 1 1 0g3 0 1 0 1g4 1 0 0 0g5 1 0 0 1
{g2}, {tf 1, tf 2, tf 3}
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Discovering Substantial Distinctions among Incremental Bi-clusters 10 / 26
Problem Model
Example
tf1 tf2 tf3 tf4
g1 1 0 1 1g2 1 1 1 0g3 0 1 0 1g4 1 0 0 0g5 1 0 0 1
{g2}, {tf 1, tf 2, tf 3}
{g2,g3}, {tf 2}
{g3}, {tf 1, tf 3}
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Discovering Substantial Distinctions among Incremental Bi-clusters 10 / 26
Problem Model
Quantifying Distinction
Distinction between C1 and C2 is large in terms of attributes, butnot objects
Consider change in both height and width
Starting at the infimum and following any path to the supermum,concepts gradually change shape
Quantify change in shape
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Discovering Substantial Distinctions among Incremental Bi-clusters 11 / 26
Problem Model
Shape index
Define a shape index
Ratio of height to width
α(C) = s1(X ,Y ) =
|A||B| if |B| ≥ |A|
|B||A| otherwise
Area of rectangle
α(C) = s2(A,B) = |A| ∗ |B|
Computing magnitude of change of α between a conceptCi =< Ai ,Bi > and one of its upper neighborsCi+1 =< Ai+1,Bi+1 > along a path Pn
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Discovering Substantial Distinctions among Incremental Bi-clusters 12 / 26
Problem Model
Shape index
Define a shape index
Ratio of height to width
α(C) = s1(X ,Y ) =
|A||B| if |B| ≥ |A|
|B||A| otherwise
Area of rectangle
α(C) = s2(A,B) = |A| ∗ |B|
Computing magnitude of change of α between a conceptCi =< Ai ,Bi > and one of its upper neighborsCi+1 =< Ai+1,Bi+1 > along a path Pn
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Discovering Substantial Distinctions among Incremental Bi-clusters 12 / 26
Problem Model
Change of shape
Compute the magnitude of the gradient
||∇sj(A,B)|| =
√
(
∂sj
∂A
)2
+
(
∂sj
∂B
)2
Compute partial derivatives by forward difference
∂sj∂Ai
7−→ sj(Ai+1,Bi)− sj(Ai ,Bi)
∂sj∂Bi
7−→ sj(Ai ,Bi+1)− sj(Ai ,Bi)
DefinitionGiven two concepts C1 =< A1,B1 >, C2 =< A2,B2 > and a shapemetric α = sj s.t. (C1,C2) ∈ E then the weight of the edge (C1,C2) is:
−√
(sj(A2,B1)− sj(A1,B1))2 + (sj (A1,B2)− sj(A1,B1))2
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Discovering Substantial Distinctions among Incremental Bi-clusters 13 / 26
Algorithms Adapting Prims Algorithm
Outline
1 Introduction
2 Related Work
3 Problem Model
4 AlgorithmsAdapting Prims Algorithm
5 Experimental Results
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Discovering Substantial Distinctions among Incremental Bi-clusters 14 / 26
Algorithms Adapting Prims Algorithm
Computational Challenges
Concept lattice not readily available
Compute MCST while computing the concept lattice
Adapting Prim’s Algorithm
Compute concept lattice incrementally (Lindig’s algorithm)
Improve computational efficiency
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Discovering Substantial Distinctions among Incremental Bi-clusters 15 / 26
Algorithms Adapting Prims Algorithm
Adapting Prim’s Algorithm
Prim’s algorithm grows sequences of trees T0,T1, . . . ,Tn−1
Ti+1 is obtained from Ti by adding a single edge ei+1,i = 0, . . . ,n − 2Edge ei+1 selected greedily among all edges having exactly onevertex in Ti and one vertex not in Ti : Cut(Ti)
Intuition: Correspondence between upper neighbors of conceptsin Ti and Cut(Ti).Define: Θ(C,Ti) set of edges between C and upper neighbors ofC that do not appear in Ti
Proposition
Given Ti and Ti−1 let ei = (C1,C2) be the edge added to Ti−1 to formTi . Then
Cut(Ti)− Cut(Ti−1) = Θ(C2,Ti)
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Discovering Substantial Distinctions among Incremental Bi-clusters 16 / 26
Algorithms Adapting Prims Algorithm
Adapting Prim’s Algorithm
Prim’s algorithm grows sequences of trees T0,T1, . . . ,Tn−1
Ti+1 is obtained from Ti by adding a single edge ei+1,i = 0, . . . ,n − 2Edge ei+1 selected greedily among all edges having exactly onevertex in Ti and one vertex not in Ti : Cut(Ti)
Intuition: Correspondence between upper neighbors of conceptsin Ti and Cut(Ti).Define: Θ(C,Ti) set of edges between C and upper neighbors ofC that do not appear in Ti
Proposition
Given Ti and Ti−1 let ei = (C1,C2) be the edge added to Ti−1 to formTi . Then
Cut(Ti)− Cut(Ti−1) = Θ(C2,Ti)
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Discovering Substantial Distinctions among Incremental Bi-clusters 16 / 26
Algorithms Adapting Prims Algorithm
Adapting Prim’s Algorithm
Prim’s algorithm grows sequences of trees T0,T1, . . . ,Tn−1
Ti+1 is obtained from Ti by adding a single edge ei+1,i = 0, . . . ,n − 2Edge ei+1 selected greedily among all edges having exactly onevertex in Ti and one vertex not in Ti : Cut(Ti)
Intuition: Correspondence between upper neighbors of conceptsin Ti and Cut(Ti).Define: Θ(C,Ti) set of edges between C and upper neighbors ofC that do not appear in Ti
Proposition
Given Ti and Ti−1 let ei = (C1,C2) be the edge added to Ti−1 to formTi . Then
Cut(Ti)− Cut(Ti−1) = Θ(C2,Ti)
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Discovering Substantial Distinctions among Incremental Bi-clusters 16 / 26
Algorithms Adapting Prims Algorithm
Adapting Prim’s Algorithm
Prim’s algorithm grows sequences of trees T0,T1, . . . ,Tn−1
Ti+1 is obtained from Ti by adding a single edge ei+1,i = 0, . . . ,n − 2Edge ei+1 selected greedily among all edges having exactly onevertex in Ti and one vertex not in Ti : Cut(Ti)
Intuition: Correspondence between upper neighbors of conceptsin Ti and Cut(Ti).Define: Θ(C,Ti) set of edges between C and upper neighbors ofC that do not appear in Ti
Proposition
Given Ti and Ti−1 let ei = (C1,C2) be the edge added to Ti−1 to formTi . Then
Cut(Ti)− Cut(Ti−1) = Θ(C2,Ti)
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Discovering Substantial Distinctions among Incremental Bi-clusters 16 / 26
Algorithms Adapting Prims Algorithm
Adapting Prim’s Algorithm
From previous proposition we know what edges to add Cut(Ti−1)to form Cut(Ti), but which ones drop out?
If C2 was just added to Ti−1 to from Ti then and edge (C2,D) inCut(Ti−1) must be removed to form Cut(Ti)
Denote these edges as Λ
Compute Cut as
Cut(Ti) = (Cut(Ti−1)− Λ) ∪Θ
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Discovering Substantial Distinctions among Incremental Bi-clusters 17 / 26
Algorithms Adapting Prims Algorithm
Adapting Prim’s Algorithm
From previous proposition we know what edges to add Cut(Ti−1)to form Cut(Ti), but which ones drop out?
If C2 was just added to Ti−1 to from Ti then and edge (C2,D) inCut(Ti−1) must be removed to form Cut(Ti)
Denote these edges as Λ
Compute Cut as
Cut(Ti) = (Cut(Ti−1)− Λ) ∪Θ
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Discovering Substantial Distinctions among Incremental Bi-clusters 17 / 26
Algorithms Adapting Prims Algorithm
Example
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Discovering Substantial Distinctions among Incremental Bi-clusters 18 / 26
Algorithms Adapting Prims Algorithm
Example
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Discovering Substantial Distinctions among Incremental Bi-clusters 18 / 26
Algorithms Adapting Prims Algorithm
Example
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Discovering Substantial Distinctions among Incremental Bi-clusters 18 / 26
Algorithms Adapting Prims Algorithm
MIDS Algorithm
1 Choose starting bi-cluster c2 Compute cut by generating upper neighbors of c and using
update equation3 Compute weight of edges between c and upper neighbors4 Greedily choose edge from cut and associated concept d5 Set c ← d repeat steps 2-5 until cut is empty
Given concept c, how do we compute its upper neighbors?
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Discovering Substantial Distinctions among Incremental Bi-clusters 19 / 26
Algorithms Adapting Prims Algorithm
MIDS Algorithm
1 Choose starting bi-cluster c2 Compute cut by generating upper neighbors of c and using
update equation3 Compute weight of edges between c and upper neighbors4 Greedily choose edge from cut and associated concept d5 Set c ← d repeat steps 2-5 until cut is empty
Given concept c, how do we compute its upper neighbors?
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Discovering Substantial Distinctions among Incremental Bi-clusters 19 / 26
Algorithms Adapting Prims Algorithm
Lindig’s Theorem
Given concept < X ,Y > all concepts greater can be computed as
S = {< ϕ(Y ∪ {o}),⊕(Y ∪ {o}) >| o ∈ O − Y}
⊕(Y ∪ {o}) = ψ(ϕ(Y ∪ {o}))
What concepts in S are upper neighbors?
TheoremLet C =< X ,Y > be a concept in D = (O,A,R), then ⊕(Y ∪ {o}),where o ∈ O is the objectset of an upper neighbor of C if and only if forall z ∈ ⊕(Y ∪ {o})− Y the following holds: ⊕(Y ∪ {z}) = ⊕(Y ∪ {o})
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Discovering Substantial Distinctions among Incremental Bi-clusters 20 / 26
Algorithms Adapting Prims Algorithm
Lindig’s Theorem
Given concept < X ,Y > all concepts greater can be computed as
S = {< ϕ(Y ∪ {o}),⊕(Y ∪ {o}) >| o ∈ O − Y}
⊕(Y ∪ {o}) = ψ(ϕ(Y ∪ {o}))
What concepts in S are upper neighbors?
TheoremLet C =< X ,Y > be a concept in D = (O,A,R), then ⊕(Y ∪ {o}),where o ∈ O is the objectset of an upper neighbor of C if and only if forall z ∈ ⊕(Y ∪ {o})− Y the following holds: ⊕(Y ∪ {z}) = ⊕(Y ∪ {o})
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Discovering Substantial Distinctions among Incremental Bi-clusters 20 / 26
Algorithms Adapting Prims Algorithm
Applying Lindig’s Theorem
Straight-forward application of Lindig’s theorem results inO(|G|2|M|) method for computing upper neighbors
Generate and test strategy
Improved Lindig’s algorithm practical running time
Theoretical complexity remains the same
MIDS algorithm complexity: O(|E | log N + N|O|2|A|)
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Discovering Substantial Distinctions among Incremental Bi-clusters 21 / 26
Experimental Results
Computation Time
Compared computation time to CHARM-L with a post processingstep
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Discovering Substantial Distinctions among Incremental Bi-clusters 22 / 26
Experimental Results
Computation Time
Compared computation time to CHARM-L with a post processingstep
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Discovering Substantial Distinctions among Incremental Bi-clusters 22 / 26
Experimental Results
Synthetic Data
Generated 3 synthetic data sets
Planted dense regions and noise
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Discovering Substantial Distinctions among Incremental Bi-clusters 23 / 26
Experimental Results
Synthetic Data
Generated 3 synthetic data sets
Planted dense regions and noise
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Discovering Substantial Distinctions among Incremental Bi-clusters 23 / 26
Experimental Results
Synthetic Data
Generated 3 synthetic data sets
Planted dense regions and noise
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Discovering Substantial Distinctions among Incremental Bi-clusters 23 / 26
Experimental Results
Real Data
Mushrooms and Congress datasets
Output top dn-sets
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Discovering Substantial Distinctions among Incremental Bi-clusters 24 / 26
Experimental Results
Real Data
Mushrooms and Congress datasets
Output top dn-setsDistinguishing Attribute(s) Lower and Upper neighbors Class Distribution
Choclate spore print colorfree gill, close gill spacing, partial veil type, white veil color, one ring
P 58.93 %E 41.07 %
free gill, close gill spacing, partial veil type, white veil color, one ring, choclate spore print colorP 97.05 %E 2.94 %
Path habitatfree gill, close gill spacing, partial veil type, white veil color, one ring
P 58.93 %E 41.07 %
free gill, close gill spacing, partial veil type, white veil color, one ring, path habitatP 91.30 %E 8.69 %
Brown Gillfree gill, partial veil type, white veil color, one ring
P 52.14 %E 47.86 %
free gill, partial veil type, white veil color, one ring, brown gillP 11.38 %E 88.62 %
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Discovering Substantial Distinctions among Incremental Bi-clusters 24 / 26
Experimental Results
Real Data
Mushrooms and Congress datasets
Output top dn-setsDistinguishing Attribute(s) Lower and Upper neighbors Class Distribution
NO physician fee freezeYES religious-groups-in-schools
R 54.77 %D 45.22 %
YES religious-groups-in-schools , NO physician fee freezeR 0.95 %D 99.05 %
YES adoption of budgetYES religious-groups-in-schools
R 54.77 %D 41.07 %
YES religious-groups-in-schools, YES adoption of budgetR 14.91 %D 85.08 %
NO religious groups in schoolYES export-administration-act-south-africa
R 35.68 %D 64.31 %
YES export-administration-act-south-africa, NO religious groups in schoolR 13.20 %D 86.79 %
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Discovering Substantial Distinctions among Incremental Bi-clusters 24 / 26
Experimental Results
Conclusion
Introduced the concept of distinguishing sets
Method to quantify distinction among bi-clusters
MIDS algorithm to grow maximum cost spanning tree in bi-clusterlattice
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Discovering Substantial Distinctions among Incremental Bi-clusters 25 / 26
Experimental Results
Conclusion
Introduced the concept of distinguishing sets
Method to quantify distinction among bi-clusters
MIDS algorithm to grow maximum cost spanning tree in bi-clusterlattice
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Discovering Substantial Distinctions among Incremental Bi-clusters 25 / 26
Experimental Results
Conclusion
Introduced the concept of distinguishing sets
Method to quantify distinction among bi-clusters
MIDS algorithm to grow maximum cost spanning tree in bi-clusterlattice
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Discovering Substantial Distinctions among Incremental Bi-clusters 25 / 26
Experimental Results
Thank you
Questions ?
Faris Alqadah Raj Bhatnagar ( Computer Science Department University of Cincinnati )Discovering Substantial Distinctions among Incremental Bi-clusters 26 / 26