compositional mining of biological datapeople.cs.vt.edu/naren/slides/cdm-talk.pdf · compositional...

69
 Compositional Mining of Biological Data Naren Ramakrishnan T.M. Murali Department of Computer Science Virginia Tech, VA 24061

Upload: others

Post on 18-Jan-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Compositional Mining of Biological Datapeople.cs.vt.edu/naren/slides/cdm-talk.pdf · Compositional Data Mining A way to compose simpler algorithms ... – Redescription mining –

   

Compositional Miningof Biological Data

Naren Ramakrishnan T.M. MuraliDepartment of Computer Science

Virginia Tech, VA 24061

Page 2: Compositional Mining of Biological Datapeople.cs.vt.edu/naren/slides/cdm-talk.pdf · Compositional Data Mining A way to compose simpler algorithms ... – Redescription mining –

   

Motivation

● Increasing categories of functional screens

Microarrays

Deletion Mutants

RNAi

Page 3: Compositional Mining of Biological Datapeople.cs.vt.edu/naren/slides/cdm-talk.pdf · Compositional Data Mining A way to compose simpler algorithms ... – Redescription mining –

   

Motivation

● Increasing forms of interaction data– PPI, ChIP-on-chip, genetic, metabolic, ...

Page 4: Compositional Mining of Biological Datapeople.cs.vt.edu/naren/slides/cdm-talk.pdf · Compositional Data Mining A way to compose simpler algorithms ... – Redescription mining –

   

Motivation

● Increasing portfolios of pathways

Page 5: Compositional Mining of Biological Datapeople.cs.vt.edu/naren/slides/cdm-talk.pdf · Compositional Data Mining A way to compose simpler algorithms ... – Redescription mining –

   

“ Chaining” Inferences

● Module Networks– Regulators “X” regulate genes “Y” under

conditions “Z”

(Segal et al. Nature Genetics,2003)

Page 6: Compositional Mining of Biological Datapeople.cs.vt.edu/naren/slides/cdm-talk.pdf · Compositional Data Mining A way to compose simpler algorithms ... – Redescription mining –

   

“ Chaining” Inferences

● Connectivity Map– Perturbagens “X” mimic/suppress disease “Y”

through action of genes “Z”

(Lamb et al. Science,2006)

Page 7: Compositional Mining of Biological Datapeople.cs.vt.edu/naren/slides/cdm-talk.pdf · Compositional Data Mining A way to compose simpler algorithms ... – Redescription mining –

   

Are we there yet?

● Different scientists, different perspectives– Multitude of approaches to data reduction

● What is needed– SQL:Database querying::???:Database mining

Page 8: Compositional Mining of Biological Datapeople.cs.vt.edu/naren/slides/cdm-talk.pdf · Compositional Data Mining A way to compose simpler algorithms ... – Redescription mining –

   

Compositional Data Mining

● A way to compose simpler algorithms ...– Redescription mining– Biclustering

● ... to support complex analytical functions● Not a data mining program

– But a data mining program generator!

Page 9: Compositional Mining of Biological Datapeople.cs.vt.edu/naren/slides/cdm-talk.pdf · Compositional Data Mining A way to compose simpler algorithms ... – Redescription mining –

   

Two simple primitives

● Redescription mining– Mines within a “domain”

● Biclustering– Mines across two domains

Page 10: Compositional Mining of Biological Datapeople.cs.vt.edu/naren/slides/cdm-talk.pdf · Compositional Data Mining A way to compose simpler algorithms ... – Redescription mining –

   

What are redescriptions?

A shift-of-vocabulary or a different way of communicating a given piece of

information.

Page 11: Compositional Mining of Biological Datapeople.cs.vt.edu/naren/slides/cdm-talk.pdf · Compositional Data Mining A way to compose simpler algorithms ... – Redescription mining –

   

Redescriptions: Toy Example

Page 12: Compositional Mining of Biological Datapeople.cs.vt.edu/naren/slides/cdm-talk.pdf · Compositional Data Mining A way to compose simpler algorithms ... – Redescription mining –

   

Redescriptions: Toy Example

Page 13: Compositional Mining of Biological Datapeople.cs.vt.edu/naren/slides/cdm-talk.pdf · Compositional Data Mining A way to compose simpler algorithms ... – Redescription mining –

   

Redescriptions: Toy Example

Page 14: Compositional Mining of Biological Datapeople.cs.vt.edu/naren/slides/cdm-talk.pdf · Compositional Data Mining A way to compose simpler algorithms ... – Redescription mining –

   

Redescriptions: Toy Example

Page 15: Compositional Mining of Biological Datapeople.cs.vt.edu/naren/slides/cdm-talk.pdf · Compositional Data Mining A way to compose simpler algorithms ... – Redescription mining –

   

Redescriptions: Toy Example

Page 16: Compositional Mining of Biological Datapeople.cs.vt.edu/naren/slides/cdm-talk.pdf · Compositional Data Mining A way to compose simpler algorithms ... – Redescription mining –

   

Redescriptions: Toy Example

Page 17: Compositional Mining of Biological Datapeople.cs.vt.edu/naren/slides/cdm-talk.pdf · Compositional Data Mining A way to compose simpler algorithms ... – Redescription mining –

   

Redescription Mining

● Given– a collection of objects (countries, genes)– a collection of descriptors

● Find– subsets that can be defined in at least two

ways

Page 18: Compositional Mining of Biological Datapeople.cs.vt.edu/naren/slides/cdm-talk.pdf · Compositional Data Mining A way to compose simpler algorithms ... – Redescription mining –

   

An example redescription

Countries with land area > 3,000,000 square miles -Tourist Destinations in the Americas

Permanent members of the UN Security Council AND

Countries with history of communism

Page 19: Compositional Mining of Biological Datapeople.cs.vt.edu/naren/slides/cdm-talk.pdf · Compositional Data Mining A way to compose simpler algorithms ... – Redescription mining –

   

More on redescriptions

● Can restrict expressions– To be of a certain syntactic form

● Can allow approximate redescriptions– Jaccards coefficient = |X ∩ Y|/|X ∪ Y|

● Can require statistical significance– According to set overlap distributions

Page 20: Compositional Mining of Biological Datapeople.cs.vt.edu/naren/slides/cdm-talk.pdf · Compositional Data Mining A way to compose simpler algorithms ... – Redescription mining –

   

Applications in Bioinformatics

● (Gene) descriptors galore!– Genes localized in the mitochondrion– Genes up-expressed >=2 fold in heat stress– Genes encoding for proteins in the

immunoglobin complex– Genes involved in glucose biosynthesis– Genes handpicked by Prof. Genie– Genes clustered by your favorite algorithm

Page 21: Compositional Mining of Biological Datapeople.cs.vt.edu/naren/slides/cdm-talk.pdf · Compositional Data Mining A way to compose simpler algorithms ... – Redescription mining –

   

Redescriptions: Application to Environment Stress in Yeast

● Descriptors over approx. 300 yeast ORFs

Page 22: Compositional Mining of Biological Datapeople.cs.vt.edu/naren/slides/cdm-talk.pdf · Compositional Data Mining A way to compose simpler algorithms ... – Redescription mining –

   

A redescription

Page 23: Compositional Mining of Biological Datapeople.cs.vt.edu/naren/slides/cdm-talk.pdf · Compositional Data Mining A way to compose simpler algorithms ... – Redescription mining –

   

A redescription

Page 24: Compositional Mining of Biological Datapeople.cs.vt.edu/naren/slides/cdm-talk.pdf · Compositional Data Mining A way to compose simpler algorithms ... – Redescription mining –

   

What redescriptions offer

● A way to bridge vocabularies– Uniformity of modeling descriptors

● Conceptual clustering– Uses one set of descriptors to define another

● Automatic determination of mutually reinforcing features– Without explicit training data

Page 25: Compositional Mining of Biological Datapeople.cs.vt.edu/naren/slides/cdm-talk.pdf · Compositional Data Mining A way to compose simpler algorithms ... – Redescription mining –

   

Biclustering

Simultaneously identify sets of entities from two domains that exhibit concerted behavior.

Page 26: Compositional Mining of Biological Datapeople.cs.vt.edu/naren/slides/cdm-talk.pdf · Compositional Data Mining A way to compose simpler algorithms ... – Redescription mining –

   

Biclusters: Toy Example

Page 27: Compositional Mining of Biological Datapeople.cs.vt.edu/naren/slides/cdm-talk.pdf · Compositional Data Mining A way to compose simpler algorithms ... – Redescription mining –

   

Biclusters: Toy Example

Page 28: Compositional Mining of Biological Datapeople.cs.vt.edu/naren/slides/cdm-talk.pdf · Compositional Data Mining A way to compose simpler algorithms ... – Redescription mining –

   

Biclusters: Toy Example

Page 29: Compositional Mining of Biological Datapeople.cs.vt.edu/naren/slides/cdm-talk.pdf · Compositional Data Mining A way to compose simpler algorithms ... – Redescription mining –

   

Biclusters: Toy Example

Page 30: Compositional Mining of Biological Datapeople.cs.vt.edu/naren/slides/cdm-talk.pdf · Compositional Data Mining A way to compose simpler algorithms ... – Redescription mining –

   

Biclusters: Toy Example

Page 31: Compositional Mining of Biological Datapeople.cs.vt.edu/naren/slides/cdm-talk.pdf · Compositional Data Mining A way to compose simpler algorithms ... – Redescription mining –

   

Biclusters: Toy Example

Page 32: Compositional Mining of Biological Datapeople.cs.vt.edu/naren/slides/cdm-talk.pdf · Compositional Data Mining A way to compose simpler algorithms ... – Redescription mining –

   

Biclusters: Toy Example

Page 33: Compositional Mining of Biological Datapeople.cs.vt.edu/naren/slides/cdm-talk.pdf · Compositional Data Mining A way to compose simpler algorithms ... – Redescription mining –

   

Biclusters: Toy Example

Page 34: Compositional Mining of Biological Datapeople.cs.vt.edu/naren/slides/cdm-talk.pdf · Compositional Data Mining A way to compose simpler algorithms ... – Redescription mining –

   

More on biclusters

● Can mine approximate biclusters– “Dense” instead of “all 1s”

● Can require statistical significance– According to set overlap distributions

Page 35: Compositional Mining of Biological Datapeople.cs.vt.edu/naren/slides/cdm-talk.pdf · Compositional Data Mining A way to compose simpler algorithms ... – Redescription mining –

   

Biclustering: Transcriptional regulation in S. cerevisiae

● Two datasets: Growth of S. cerevisiae cells in rich medium and under exposure to rapamycin

● What are the differences between the activated transcriptional regulatory network under these two conditions?

Page 36: Compositional Mining of Biological Datapeople.cs.vt.edu/naren/slides/cdm-talk.pdf · Compositional Data Mining A way to compose simpler algorithms ... – Redescription mining –

   

Computed biclusters

Page 37: Compositional Mining of Biological Datapeople.cs.vt.edu/naren/slides/cdm-talk.pdf · Compositional Data Mining A way to compose simpler algorithms ... – Redescription mining –

   

Combinatorial control by RTG3 and GLN3

Page 38: Compositional Mining of Biological Datapeople.cs.vt.edu/naren/slides/cdm-talk.pdf · Compositional Data Mining A way to compose simpler algorithms ... – Redescription mining –

   

Recap

● Redescriptions– Map descriptors within a domain (e.g., genes

to genes)

● Biclusters– Map descriptors across domains (e.g., TFs to

genes)

● Key idea: can arbitrarily compose these– To bridge diverse domains

Page 39: Compositional Mining of Biological Datapeople.cs.vt.edu/naren/slides/cdm-talk.pdf · Compositional Data Mining A way to compose simpler algorithms ... – Redescription mining –

   

CDM: Desiccation tolerance in C.elegans

● Question: Find a set of genes to knock-down, via RNAi, so as to confer improved desiccation tolerance in C. elegans

● Available data:– Genes X TFs– Genes X Phenotypes

Page 40: Compositional Mining of Biological Datapeople.cs.vt.edu/naren/slides/cdm-talk.pdf · Compositional Data Mining A way to compose simpler algorithms ... – Redescription mining –

   

CDM: Desiccation tolerance in C.elegans

Two biclusters joined at the Gene interface

Page 41: Compositional Mining of Biological Datapeople.cs.vt.edu/naren/slides/cdm-talk.pdf · Compositional Data Mining A way to compose simpler algorithms ... – Redescription mining –

   

CDM: Aging in worms and flies

● Question: analyze similarities in gene expression programs underlying aging in C. elegans and D. melanogaster

● Available data:– Worm age X Worm genes (exp. values)– Worm genes X Fly genes (homology)– Fly age X Fly genes (exp. values)

Page 42: Compositional Mining of Biological Datapeople.cs.vt.edu/naren/slides/cdm-talk.pdf · Compositional Data Mining A way to compose simpler algorithms ... – Redescription mining –

   

CDM: Aging in worms and flies

Three biclusters related by two redescriptions

Page 43: Compositional Mining of Biological Datapeople.cs.vt.edu/naren/slides/cdm-talk.pdf · Compositional Data Mining A way to compose simpler algorithms ... – Redescription mining –

   

CDM Software Architecture

● Data Model Compiler● Data Mining Plan Generator● Visualization Interfaces

Page 44: Compositional Mining of Biological Datapeople.cs.vt.edu/naren/slides/cdm-talk.pdf · Compositional Data Mining A way to compose simpler algorithms ... – Redescription mining –

   

Data Model Compiler

● From a specification of– a database schema (SQL DDLs)

● Automatically generate– a database schema for CDM– redescription/biclustering algo. Interfaces

Page 45: Compositional Mining of Biological Datapeople.cs.vt.edu/naren/slides/cdm-talk.pdf · Compositional Data Mining A way to compose simpler algorithms ... – Redescription mining –

   

Data Mining Plan Generator

● Compile a request – for connections between biological domains

● Into– A composition of redescriptions and

biclusters

● Research issues– Set-based versus tuple-based joins– Hard versus soft joins– Use “query flocks” to organize related

queries

Page 46: Compositional Mining of Biological Datapeople.cs.vt.edu/naren/slides/cdm-talk.pdf · Compositional Data Mining A way to compose simpler algorithms ... – Redescription mining –

   

Visualization Interfaces

● Three-tiered interface– Bicluster level view– Set view– Tuple (individual) view

Page 47: Compositional Mining of Biological Datapeople.cs.vt.edu/naren/slides/cdm-talk.pdf · Compositional Data Mining A way to compose simpler algorithms ... – Redescription mining –

   

CDM Software Architecture

Page 48: Compositional Mining of Biological Datapeople.cs.vt.edu/naren/slides/cdm-talk.pdf · Compositional Data Mining A way to compose simpler algorithms ... – Redescription mining –

   

Case studies

● Storytelling in PubMed abstracts● Yeast functional genomics● Small molecule-gene-disease modeling

Page 49: Compositional Mining of Biological Datapeople.cs.vt.edu/naren/slides/cdm-talk.pdf · Compositional Data Mining A way to compose simpler algorithms ... – Redescription mining –

   

Biological storytelling

Study metabolic arrest/recovery across organisms of diverse complexity

Page 50: Compositional Mining of Biological Datapeople.cs.vt.edu/naren/slides/cdm-talk.pdf · Compositional Data Mining A way to compose simpler algorithms ... – Redescription mining –

   

Storytelling as CDM

● Compose only redescriptions– No biclusters

● Do not use set constructions– Just given descriptors

● Goal:– Relate dis-similar entities through

compositions of similarities

Page 51: Compositional Mining of Biological Datapeople.cs.vt.edu/naren/slides/cdm-talk.pdf · Compositional Data Mining A way to compose simpler algorithms ... – Redescription mining –

   

Storytelling is sort of like ...

● the MorphWord puzzle– PURE– PORE– POLE– POLL– POOL– WOOL

Page 52: Compositional Mining of Biological Datapeople.cs.vt.edu/naren/slides/cdm-talk.pdf · Compositional Data Mining A way to compose simpler algorithms ... – Redescription mining –

   

Example storytelling task

● Connect– L. Garczarek, N. Ramakrishnan, D. Kumar, R.F. Helm,

and M. Potts, Global cross-over points in the genome responses of Synechocystis sp. PCC 6803, to dehydration, UV-irradiation, and other stresses, under communication to BMC Microbiology, 2007.

● To

– M.B. Roth and T. Nystul, Buying time in suspended animation, Scientific American, Vol. 292, No. 6, pages 48-55, June 2005.

Page 53: Compositional Mining of Biological Datapeople.cs.vt.edu/naren/slides/cdm-talk.pdf · Compositional Data Mining A way to compose simpler algorithms ... – Redescription mining –

   

Spinning a story ...

● From– L. Garczarek, N. Ramakrishnan, D. Kumar, R.F. Helm,

and M. Potts, Global cross-over points in the genome responses of Synechocystis sp. PCC 6803, to dehydration, UV-irradiation, and other stresses, under communication to BMC Microbiology, 2007.

● To

– L. Schmitt and R. Tampe, Structure and mechanism of ABC transporters, Current Opinion in Structural Biology, Vol. 14, No. 4, pages 426-431, Aug 2004.

Link: CBS Domains

Page 54: Compositional Mining of Biological Datapeople.cs.vt.edu/naren/slides/cdm-talk.pdf · Compositional Data Mining A way to compose simpler algorithms ... – Redescription mining –

   

Spinning a story ...

● From

– L. Schmitt and R. Tampe, Structure and mechanism of ABC transporters, Current Opinion in Structural Biology, Vol. 14, No. 4, pages 426-431, Aug 2004.

● To

– J.W. Scott, S.A. Hawley, K.A. Green, M. Anis, G. Stewart, G.A. Scullion, D.G. Norman, and D.G. Hardie, CBS domains form energy-sensing modules whose binding of adenosine ligands is disrupted by disease mutations, Journal of Clinical Investigation, Vol. 113, No. 2, pages 182-184, Jan 2004.

Link: Molecular complexes of CBS Domains

Page 55: Compositional Mining of Biological Datapeople.cs.vt.edu/naren/slides/cdm-talk.pdf · Compositional Data Mining A way to compose simpler algorithms ... – Redescription mining –

   

Spinning a story ...

● From

– J.W. Scott, S.A. Hawley, K.A. Green, M. Anis, G. Stewart, G.A. Scullion, D.G. Norman, and D.G. Hardie, CBS domains form energy-sensing modules whose binding of adenosine ligands is disrupted by disease mutations, Journal of Clinical Investigation, Vol. 113, No. 2, pages 182-184, Jan 2004.

● To

– C. Tang, X. Li and J. Du, Hydrogen sulfide as a new endogenous gaseous transmitter in the cardiovascular system, Current Vascular Pharmacology, Vol. 4, No. 1, pages 17-22, Jan 2006.

Link: Ligands bound to CBS Domains

Page 56: Compositional Mining of Biological Datapeople.cs.vt.edu/naren/slides/cdm-talk.pdf · Compositional Data Mining A way to compose simpler algorithms ... – Redescription mining –

   

Spinning a story ...

● From

– C. Tang, X. Li and J. Du, Hydrogen sulfide as a new endogenous gaseous transmitter in the cardiovascular system, Current Vascular Pharmacology, Vol. 4, No. 1, pages 17-22, Jan 2006.

● To

– M.B. Roth and T. Nystul, Buying time in suspended animation, Scientific American, Vol. 292, No. 6, pages 48-55, June 2005.

Link: Hydrogen sulphide

Page 57: Compositional Mining of Biological Datapeople.cs.vt.edu/naren/slides/cdm-talk.pdf · Compositional Data Mining A way to compose simpler algorithms ... – Redescription mining –

   

Storytelling on System X

● Distributed indexing and similarity search● Bidirectional pursuing of “leads”● Simulations for significance testing

Page 58: Compositional Mining of Biological Datapeople.cs.vt.edu/naren/slides/cdm-talk.pdf · Compositional Data Mining A way to compose simpler algorithms ... – Redescription mining –

   

Stories about storytelling

Page 59: Compositional Mining of Biological Datapeople.cs.vt.edu/naren/slides/cdm-talk.pdf · Compositional Data Mining A way to compose simpler algorithms ... – Redescription mining –

   

Biological storytelling

● Given– 18 extra-cellular molecules

● CD38, CXCL1, IFN-gamma, IGF-1, IL-13, IL-1beta, IL-24, IL-6, IL-8, MMP etc.

– 1 intra-cellular molecule● (poly)ADP-ribose

● Find– Chains of redescriptions between abstracts

discussing these molecules

Page 60: Compositional Mining of Biological Datapeople.cs.vt.edu/naren/slides/cdm-talk.pdf · Compositional Data Mining A way to compose simpler algorithms ... – Redescription mining –

   

Biological storytelling

● Document seed set– Retrieve 203,872 documents

● Remove review papers

– Label 4757 documents with molecules (4737+20)

● Document modeling for sim search– 96,218 terms after stemming & stopword

removal– Weighted TFIDF (for doc length)

Page 61: Compositional Mining of Biological Datapeople.cs.vt.edu/naren/slides/cdm-talk.pdf · Compositional Data Mining A way to compose simpler algorithms ... – Redescription mining –

   

Biological storytelling

● Storytelling algorithm tradeoffs– Higher similarity versus shorter stories

Page 62: Compositional Mining of Biological Datapeople.cs.vt.edu/naren/slides/cdm-talk.pdf · Compositional Data Mining A way to compose simpler algorithms ... – Redescription mining –

   

Biological storytelling

● Basic statistics– Most popular hub

● PubMed ID 8064725: `Altered poly(ADP-ribose) metabolism in family members of patients with systemic lupus erythematosus'

– Second most popular hub● PubMed ID 2684169: `Two

types of antibodies inhibiting interleukin-2 production by normal lymphocytes in patients with systemic lupus erythematosus'

Page 63: Compositional Mining of Biological Datapeople.cs.vt.edu/naren/slides/cdm-talk.pdf · Compositional Data Mining A way to compose simpler algorithms ... – Redescription mining –

   

Biological storytelling

● Frequent episode mining– Mining novellas– e.g., PubMed ID 16430457 -> ... -> 1386861

● Story compression– Reduce novellas to single symbol– Identify and remove frequently reused

subpaths

● Story summarization– Tile sentences using sentence cohesion

check

Page 64: Compositional Mining of Biological Datapeople.cs.vt.edu/naren/slides/cdm-talk.pdf · Compositional Data Mining A way to compose simpler algorithms ... – Redescription mining –

   

The StoryGrapher

Available for demo/download athttps://bioinformatics.cs.vt.edu/storytelling/

Page 65: Compositional Mining of Biological Datapeople.cs.vt.edu/naren/slides/cdm-talk.pdf · Compositional Data Mining A way to compose simpler algorithms ... – Redescription mining –

   

Sentence-tiled story

Page 66: Compositional Mining of Biological Datapeople.cs.vt.edu/naren/slides/cdm-talk.pdf · Compositional Data Mining A way to compose simpler algorithms ... – Redescription mining –

   

Yet to do...

● Model– Cell types and cell lines

● Account for– “artificial enrichment” for certain

methodologies

● Address– Author bias– Messiness of information integration

Page 67: Compositional Mining of Biological Datapeople.cs.vt.edu/naren/slides/cdm-talk.pdf · Compositional Data Mining A way to compose simpler algorithms ... – Redescription mining –

   

Status of CDM

● Implemented using open source software– Parallel implementations of key algorithms

and significance calculations

● Many instantiations underway– VIGEN (Virginia Center for Genomics)– VBI (Virginia Bioinformatics Institute)

● We welcome collaborations!

Page 68: Compositional Mining of Biological Datapeople.cs.vt.edu/naren/slides/cdm-talk.pdf · Compositional Data Mining A way to compose simpler algorithms ... – Redescription mining –

   

Acknowledgements

● BIO faculty– Rich Helm– Malcolm Potts

● CS students– Joe Gresock– Deept Kumar– Greg Grothaus– Srinivas Santhanam– Mahima Gopalakrishnan– Anthony McNevin

Page 69: Compositional Mining of Biological Datapeople.cs.vt.edu/naren/slides/cdm-talk.pdf · Compositional Data Mining A way to compose simpler algorithms ... – Redescription mining –

   

Thank you!

● Contact info:– Naren Ramakrishnan, [email protected],

http://people.cs.vt.edu/~naren– T.M. Murali, [email protected],

http://people.cs.vt.edu/~murali