dr. mahout: analyzing clinical data using scalable and distributed computing shannon quinn cpcb...

29
Dr. Mahout: Analyzing clinical data using scalable and distributed computing Shannon Quinn CPCB [email protected] | [email protected] November 10, 2011 1/29

Post on 15-Jan-2016

222 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Dr. Mahout: Analyzing clinical data using scalable and distributed computing Shannon Quinn CPCB squinn@cmu.edusquinn@cmu.edu | spq1@pitt.eduspq1@pitt.edu

Dr. Mahout:Analyzing clinical data using scalable

and distributed computingShannon Quinn

[email protected] | [email protected]

November 10, 2011

1/29

Page 2: Dr. Mahout: Analyzing clinical data using scalable and distributed computing Shannon Quinn CPCB squinn@cmu.edusquinn@cmu.edu | spq1@pitt.eduspq1@pitt.edu

Punchline Cloud computing for biological and

clinical data analysis Problem: high- dimensional, noisy!

Heart tissue: biomedcentralfMRI: wikipediasegmentation: biodynamics UCSD

tech2date.com

2/29

Page 3: Dr. Mahout: Analyzing clinical data using scalable and distributed computing Shannon Quinn CPCB squinn@cmu.edusquinn@cmu.edu | spq1@pitt.eduspq1@pitt.edu

Disclaimer

3/29

Biology jargon

Academic jargon

Page 4: Dr. Mahout: Analyzing clinical data using scalable and distributed computing Shannon Quinn CPCB squinn@cmu.edusquinn@cmu.edu | spq1@pitt.eduspq1@pitt.edu

My Background 2nd year Ph.D. student in CPCB Program

Research in bioimage informatics

4/29

Page 5: Dr. Mahout: Analyzing clinical data using scalable and distributed computing Shannon Quinn CPCB squinn@cmu.edusquinn@cmu.edu | spq1@pitt.eduspq1@pitt.edu

My Background Other

5/29

http://collegefootballbelt.com/Logos/

http://s3.amazonaws.com/data.tumblr.com/

Page 6: Dr. Mahout: Analyzing clinical data using scalable and distributed computing Shannon Quinn CPCB squinn@cmu.edusquinn@cmu.edu | spq1@pitt.eduspq1@pitt.edu

Computational biology and …the cloud?

Biological data• is BIG

• requires repetitive analysis in chunks

• modeling involves linear algebra and statistics

6/29

Page 7: Dr. Mahout: Analyzing clinical data using scalable and distributed computing Shannon Quinn CPCB squinn@cmu.edusquinn@cmu.edu | spq1@pitt.eduspq1@pitt.edu

Use case 1: protein behavior

timescale of relevant motionsbond vibration side-chain

rotationdomain shifts/max. catalysis

protein folding

global conformational shifts

[

10-15 10-6 10-3 10010-910-12

detail sampling

a common tradeoff…7/29

Page 8: Dr. Mahout: Analyzing clinical data using scalable and distributed computing Shannon Quinn CPCB squinn@cmu.edusquinn@cmu.edu | spq1@pitt.eduspq1@pitt.edu

Molecular dynamics

8/29

Page 9: Dr. Mahout: Analyzing clinical data using scalable and distributed computing Shannon Quinn CPCB squinn@cmu.edusquinn@cmu.edu | spq1@pitt.eduspq1@pitt.edu

“The curse of [MD] dimensionality”

MD := for every atom for every t …€

F = ma

9/29

http://icanhascheezburger.files.wordpress.com/http://www.pdb.org/pdb/explore/explore.do?structureId=3fxi

Page 10: Dr. Mahout: Analyzing clinical data using scalable and distributed computing Shannon Quinn CPCB squinn@cmu.edusquinn@cmu.edu | spq1@pitt.eduspq1@pitt.edu

Pipeline for MD trajectory analysis

Find a “surface” of protein shapes1. MD output2. Define surface

(graph!)3. Partition surface

10/29

http://www.dillgroup.ucsf.edu/

Page 11: Dr. Mahout: Analyzing clinical data using scalable and distributed computing Shannon Quinn CPCB squinn@cmu.edusquinn@cmu.edu | spq1@pitt.eduspq1@pitt.edu

Mahout implementationDefining surface/graph:

MatrixMultiplicationJob (matrixmult)

TransposeJob (transpose)

DistributedLanczosSolver (svd)

StochasticSVD (ssvd)

Partitioning surface/graph:

SpectralKMeans (spectralkmeans)

Eigencuts (eigencuts)

Kmeans (kmeans)

. . .

11/29

Page 12: Dr. Mahout: Analyzing clinical data using scalable and distributed computing Shannon Quinn CPCB squinn@cmu.edusquinn@cmu.edu | spq1@pitt.eduspq1@pitt.edu

MD in Mahout conclusion MD simulations

(x@Home projects)

Existing Mahout functionality

Additional algorithms

http://folding.stanford.edu/

12/29

Page 13: Dr. Mahout: Analyzing clinical data using scalable and distributed computing Shannon Quinn CPCB squinn@cmu.edusquinn@cmu.edu | spq1@pitt.eduspq1@pitt.edu

Use case 2: diseases affecting cilia What are cilia?• Hairlike structures• Keep things

moving• Diseased

cilia =

13/29

http://fc06.deviantart.net/fs71/f/2010/177/d/5/Sad_Panda_by_jinxii24.jpg

Page 14: Dr. Mahout: Analyzing clinical data using scalable and distributed computing Shannon Quinn CPCB squinn@cmu.edusquinn@cmu.edu | spq1@pitt.eduspq1@pitt.edu

Importance of correct diagnoses Symptoms look

familiar Consequences do

not

14/29

Page 15: Dr. Mahout: Analyzing clinical data using scalable and distributed computing Shannon Quinn CPCB squinn@cmu.edusquinn@cmu.edu | spq1@pitt.eduspq1@pitt.edu

Beat pattern of cilia tells a lot! Clinicians look at cilia motion in making

their diagnoses1. What is the motion called?2. Can we create a database of motions?

15/29

Page 16: Dr. Mahout: Analyzing clinical data using scalable and distributed computing Shannon Quinn CPCB squinn@cmu.edusquinn@cmu.edu | spq1@pitt.eduspq1@pitt.edu

Clinicians’ ultimate goal

Category 1 Category 2 Category 3? ? ?

16/29

Page 17: Dr. Mahout: Analyzing clinical data using scalable and distributed computing Shannon Quinn CPCB squinn@cmu.edusquinn@cmu.edu | spq1@pitt.eduspq1@pitt.edu

Cilia as dynamic textures Computer vision

Saisan et al 2001

Properties

17/29

Page 18: Dr. Mahout: Analyzing clinical data using scalable and distributed computing Shannon Quinn CPCB squinn@cmu.edusquinn@cmu.edu | spq1@pitt.eduspq1@pitt.edu

The [proposed] pipeline Step 1• Clinician captures video and uploads it

http://googolplex.dyndns.org/cilia/

18/29

Page 19: Dr. Mahout: Analyzing clinical data using scalable and distributed computing Shannon Quinn CPCB squinn@cmu.edusquinn@cmu.edu | spq1@pitt.eduspq1@pitt.edu

The [proposed] pipeline Step 2• Mahout job: autoregressive modeling

y t ~ Cx t

x t ~ A1x t−1 + ...

Appearance Model Dynamic Model

http://web.media.mit.edu/~tristan/phd/dissertation/figures/manifold2.jpg

19/29

Page 20: Dr. Mahout: Analyzing clinical data using scalable and distributed computing Shannon Quinn CPCB squinn@cmu.edusquinn@cmu.edu | spq1@pitt.eduspq1@pitt.edu

The [proposed] pipeline Step 3• Add the transition matrices to cloud library

A =

20/29

Page 21: Dr. Mahout: Analyzing clinical data using scalable and distributed computing Shannon Quinn CPCB squinn@cmu.edusquinn@cmu.edu | spq1@pitt.eduspq1@pitt.edu

The [proposed] pipeline Step 4• Recompute network with added videos

Axis

2

Axis 1

?

21/29

Page 22: Dr. Mahout: Analyzing clinical data using scalable and distributed computing Shannon Quinn CPCB squinn@cmu.edusquinn@cmu.edu | spq1@pitt.eduspq1@pitt.edu

One more thing… What’s really cool about AR models:• Can you spot the fake?

Synthetic Original

22/29

Page 23: Dr. Mahout: Analyzing clinical data using scalable and distributed computing Shannon Quinn CPCB squinn@cmu.edusquinn@cmu.edu | spq1@pitt.eduspq1@pitt.edu

Mahout implementationLearning autoregressive models:

MatrixMultiplicationJob (matrixmult)

TransposeJob (transpose)

DistributedLanczosSolver (svd)

StochasticSVD (ssvd)

Comparing autoregressive parameters:

SpectralKMeans (spectralkmeans)

Eigencuts (eigencuts)

Frobenius norm

Tensors

? ? ?

23/29

Page 24: Dr. Mahout: Analyzing clinical data using scalable and distributed computing Shannon Quinn CPCB squinn@cmu.edusquinn@cmu.edu | spq1@pitt.eduspq1@pitt.edu

Cilia on Mahout conclusions Autoregressive modeling uses linear algebra

that is already implemented

Maintaining AR library requires new functionality

Mahout framework gives us elbow room

24/29

Page 25: Dr. Mahout: Analyzing clinical data using scalable and distributed computing Shannon Quinn CPCB squinn@cmu.edusquinn@cmu.edu | spq1@pitt.eduspq1@pitt.edu

Final Thoughts Biological / biomedical data is large,

high-dimensional, and noisy

We extend Mahout’s current linear algebra framework (spectral clustering, autoregressive models)

We provide a cloud framework!

25/29

Page 26: Dr. Mahout: Analyzing clinical data using scalable and distributed computing Shannon Quinn CPCB squinn@cmu.edusquinn@cmu.edu | spq1@pitt.eduspq1@pitt.edu

Research Group University of Pittsburgh• Dr. Chakra Chennubhotla Lab (advisor)

CMU@Qatar• Dr. Majd Sakr Lab (collaborator)

University of Pittsburgh Medical Center• Dr. Cecilia Lo Lab (collaborator)

26/29

Page 27: Dr. Mahout: Analyzing clinical data using scalable and distributed computing Shannon Quinn CPCB squinn@cmu.edusquinn@cmu.edu | spq1@pitt.eduspq1@pitt.edu

Sources Resources• Apache Mahout• Spectrally Clustered

Links• Categorizing ciliary motion defects (BSEC 2011)• Eigencuts spectral clustering algorithm

Technical report (coming soon!)

27/29

Page 28: Dr. Mahout: Analyzing clinical data using scalable and distributed computing Shannon Quinn CPCB squinn@cmu.edusquinn@cmu.edu | spq1@pitt.eduspq1@pitt.edu

Contact Shannon Quinn• [email protected] | [email protected] • http://www.magsolweb.net/

28/29

Page 29: Dr. Mahout: Analyzing clinical data using scalable and distributed computing Shannon Quinn CPCB squinn@cmu.edusquinn@cmu.edu | spq1@pitt.eduspq1@pitt.edu

Thank you!

29/29

http://icanhascheezburger.files.wordpress.com/