presentation 2009 journal club azhar ali shah

32
Azhar Ali Shah @ Interdisciplinary Optimization and Decision Making Journal Club (IODMJC) IODMJC, March 20 , 2009

Upload: guest5de83e

Post on 23-Jan-2015

395 views

Category:

Education


1 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Presentation 2009 Journal Club Azhar Ali Shah

Azhar Ali Shah

@ Interdisciplinary Optimization and Decision Making Journal Club (IODMJC)

IODMJC, March 20 , 2009

Page 2: Presentation 2009 Journal Club Azhar Ali Shah

Azhar A Shah Efficient algorithms for accurate hierarchical clustering of hugedatasets: tackling the entire protein space

2/31

Overview Introduction

About the authors About the topic

Hierarchical Clustering UPGMA

Research Problem Methodology

Suite of algorithms Results Observations

Page 3: Presentation 2009 Journal Club Azhar Ali Shah

Azhar A Shah Efficient algorithms for accurate hierarchical clustering of hugedatasets: tackling the entire protein space

3/31

Introduction: authors

Page 4: Presentation 2009 Journal Club Azhar Ali Shah

Azhar A Shah Efficient algorithms for accurate hierarchical clustering of hugedatasets: tackling the entire protein space

4/31

Introduction: Hierarchical Clustering

Page 5: Presentation 2009 Journal Club Azhar Ali Shah

Azhar A Shah Efficient algorithms for accurate hierarchical clustering of hugedatasets: tackling the entire protein space

5/31

Introduction: Hierarchical Clustering Two fundamental problems in hierarchical

clustering:

1. How to determine the similarity between two objects (eg. Proteins, genes)? Calculate the distance between two object (e.g

RMSD etc).

2. How to determine the similarity between two clusters? (Single, Complete, Average) linkage:

Page 6: Presentation 2009 Journal Club Azhar Ali Shah

Azhar A Shah Efficient algorithms for accurate hierarchical clustering of hugedatasets: tackling the entire protein space

6/31

Introduction: about the topic

There is no guideline for selecting the best linkage method. In practice, people almost always use average linkage.

UPGMA (Unweighted Pair GroupMethod using arithmetic Averages)

Scalable to large datasets as it requires only (O(1)) edges in memory.

BUTHighly susceptible to outliers!

Page 7: Presentation 2009 Journal Club Azhar Ali Shah

Introduction: UPGMA Input format:

Three fields per line Cluster_id1 cluster_id2 distance

Assumptions on input: Cluster IDs are >0 integers No self edges i.e Cluster_id1==cluster_id2 is illegal No repeated edges i.e if exists i<->j then no j<->i

Output format: Four fields per line

Cluster_id1 cluster_id2 distance cluster_id3 Cluster_id1 cluster_id2 identify the pair of merged clusters while

cluster_id3 is an identifier for a new cluster – their union.

Page 8: Presentation 2009 Journal Club Azhar Ali Shah

Introduction: UPGMA -Sparse input

UPGMA-input

1 2 1e-100

1 3 1e-40

1 4 2e-40

2 3 1e-80

2 4 1e-50

3 4 4e-10

11 12 1e+01

11 13 11

12 13 12

12 14 20

13 14 30

21 22 50

22 23 70

1 23 90

N=11 input singletons (vertices): {1,2,3,4,11,12,13,14,21,22,23}

and 14 edges in the sparse input.

The input is considered sparse since not all pairs are given e.g. there is no edge b/w 1 and 22.

Clusters 1,2,3,4 form a clique A.

Clusters 11,12,13,14 are missing edge <11,14> to form clique B.

Clusters 21,22,23 are loosely connected to each other and to the cluster of clique A.

In total there are two connected components in the input graph:

({1,2,3,4,21,22,23}) (producing 6 merges for 7 vertices) and

{11,12,13,14} (producing 4 merges for 3 nodes), which

therefore forms a forest of two disjoint trees, rather than the full

tree of N-1=10 merges.

UPGMA-tree

1 2 1e-100 24

3 24 5e-41 25

4 25 1.33e-10 26

11 12 10 27

13 27 11.5 28

21 22 50 29

14 28 50 30

23 29 85 31

26 31 99.167 32

Page 9: Presentation 2009 Journal Club Azhar Ali Shah

Azhar A Shah Efficient algorithms for accurate hierarchical clustering of hugedatasets: tackling the entire protein space

9/31

Research Problem: UPGMA UPGMA requires the entire dissimilarity matrix to be in memory:

Th

is data

renders U

PG

MA

im

pra

ctical

Page 10: Presentation 2009 Journal Club Azhar Ali Shah

Azhar A Shah Efficient algorithms for accurate hierarchical clustering of hugedatasets: tackling the entire protein space

10/31

Methodology: 1) Sparse-UPGMA

Can’t cope with huge datasets, where an O(E) memory requirement is intolerable (e.g. Table 1).

UPGMA (mean):

New eq:

Time and memory improvement:

guest
thick edges (neighbours) are calculated in O(1) time.
Page 11: Presentation 2009 Journal Club Azhar Ali Shah

Methodology: 2) Multi-Round MC-UPGMA Requirements:

A correct clusterer should be mindful of unseen edges (≥λ), effecting clustering before λ (max of loaded edges).

Such examples are rather prevalent in non-metric datasets e.g. the case of clustering sequence similarities.

Illustration of non-metric constraints imposed by BLAST sequence similarities (eges). False

transitivity is possible due to CSKP_HUMAN.

Page 12: Presentation 2009 Journal Club Azhar Ali Shah

Azhar A Shah Efficient algorithms for accurate hierarchical clustering of hugedatasets: tackling the entire protein space

12/31

Methodology: 2) Multi-Round MC-UPGMA

Solution: To prevent false clustering of a non-minimal edge, suitable bounds per edge are maintained.

The value of dij is lower (lij) and upper (uij) bounded as:

Page 13: Presentation 2009 Journal Club Azhar Ali Shah

Azhar A Shah Efficient algorithms for accurate hierarchical clustering of hugedatasets: tackling the entire protein space

13/31

Methodology: 2) Multi-Round MC-UPGMA

1. When Multi-Round MC-UPGMA halts, it is not using its entire memory budget M, since each merge reduces the number of edges in memory.

2. Most of the computation time is spent on preprocessing for the next round of clustering.

guest
merger: Processes partial clustering and set of current edges. Edges grow thicker as clusters grow larger. Current set of thick edges is input to the next round. Algorithm is same as sparse-upgma however it loads only M minimal edges in Et. Edges Dij are repaced with intervals Lij and Uij to accomodate uncertain edge values (due to partia edges at hand)l .Clustering halts when it is impossible to identify minimal edge in entire Et.Clustering proceeds while a distinctly minimal edge is at hand -- an edge whose upper bound is lower than the lower bound of nay edge in Et.
Page 14: Presentation 2009 Journal Club Azhar Ali Shah

Azhar A Shah Efficient algorithms for accurate hierarchical clustering of hugedatasets: tackling the entire protein space

14/31

Methodology: 2) Single-Round MC-UPGMA

Requires O(n) memory for holding forming tree!

guest
uses freed-up memory to load fresh edges.to accomodate the reloading of old invalid edges, a new edge representation is introduced.
Page 15: Presentation 2009 Journal Club Azhar Ali Shah

Methodology: 2) Single-Round MC-UPGMA

Page 16: Presentation 2009 Journal Club Azhar Ali Shah

Methods Clustered data:

UniRef90 (release 8.5) non-redundant 1.80M sequences

BLAST Similarities: blastp with E=100 run on MOSIX grid reciprocal-BLAST-like setting – each sequence is used both as a query

and database entry The directed multigraph1 is transformed to undirected graph2

(symmetric dissimilarities)1. 2.5x109 edges (50 GB)2. 1.5x109 edges (30 GB)

Page 17: Presentation 2009 Journal Club Azhar Ali Shah

Methods Protein Family Keywords

Interpro classification is used as a mapping of keywords to protein sequences

Metrics

Jaccard Score

Page 18: Presentation 2009 Journal Club Azhar Ali Shah

Results from 1 801 506 UniRef90 proteins.

1107 (0.06%) proteins are singletons having no BLAST similarities.

From the clustered set, 1 791 206 proteins (99.5%) are clustered into a single tree.

1 497 733 of the tree clusters (83.6%) are fully linked, including 426 360 large clusters with at least 10 members.

Page 19: Presentation 2009 Journal Club Azhar Ali Shah

Results

Smith–Waterman

BLAST

Sparse UPGMA

With reduced dataset

220K

1.80M

Page 20: Presentation 2009 Journal Club Azhar Ali Shah

Results

200 clustering rounds on a single 4GB memory 4-CPU workstation took about 1-2 days.

Page 21: Presentation 2009 Journal Club Azhar Ali Shah

Results

Page 22: Presentation 2009 Journal Club Azhar Ali Shah

Observations No detailed discussion on parallelization No results of Single round MC-UPGMA

Page 23: Presentation 2009 Journal Club Azhar Ali Shah

Azhar A Shah Efficient algorithms for accurate hierarchical clustering of hugedatasets: tackling the entire protein space

23/31

Page 24: Presentation 2009 Journal Club Azhar Ali Shah

Clu

ster C

ard

Page

Page 25: Presentation 2009 Journal Club Azhar Ali Shah

View Proteins of Cluster

Page 26: Presentation 2009 Journal Club Azhar Ali Shah

Keyw

ord

s Ap

peara

nce

s

Page 27: Presentation 2009 Journal Club Azhar Ali Shah

Clu

ster S

imila

rity D

istributio

n

Page 28: Presentation 2009 Journal Club Azhar Ali Shah

simila

rity m

atrix

for th

e p

rote

ins in

this clu

ster

Page 29: Presentation 2009 Journal Club Azhar Ali Shah
Page 30: Presentation 2009 Journal Club Azhar Ali Shah
Page 31: Presentation 2009 Journal Club Azhar Ali Shah
Page 32: Presentation 2009 Journal Club Azhar Ali Shah