juliet hougland, data scientist, cloudera at mlconf nyc

Post on 15-Jul-2015

556 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

‹#›© Cloudera, Inc. All rights reserved.

Juliet Hougland Data Scientist @j_houg

Matrix Decomposition at Scale

‹#›© Cloudera, Inc. All rights reserved.

The Singular Value Decomposition

‹#›© Cloudera, Inc. All rights reserved.

• Dimensionality Reduction/PCA • Feature dimension reduction • Visualization of gene expression data

• Latent Semantic Indexing • Low Rank Approximations • Digital Signals Processing

SVD is applied everywhere

A Global Map of Human Gene Expression. Lukk Et al. [1]

‹#›© Cloudera, Inc. All rights reserved.

Define SVD

‹#›© Cloudera, Inc. All rights reserved.

Totally awesome LANL video

‹#›© Cloudera, Inc. All rights reserved.

This doesn’t work on distributed, commodity setups

Good ClusterBad Cluster

‹#›© Cloudera, Inc. All rights reserved.

3 Distributed OSS SVD ImplementationsMahout: Lanczos Mahout: Stochastic Spark: Lanczos

‹#›© Cloudera, Inc. All rights reserved.

Lanczos’ Method

‹#›© Cloudera, Inc. All rights reserved.

• Iterative, with the dominant cost a matrix-vector multiply • Requires at least k iterations to get k singular vectors

Lanczos’ Method

‹#›© Cloudera, Inc. All rights reserved.

• Randomly project original matrix to lower dimensional space • Factorize the projected matrix. • Unproject

Stochastic SVDM ⇡ QQ⇤M

Finding Structure in Randomness. Halko Et al. http://bit.ly/19VVRXp

‹#›© Cloudera, Inc. All rights reserved.

• What I test is written on MapReduce • Driver programs launch the series of required map reduce jobs • Lots of writing intermediate data to disk

Frameworks

• Using the MLLib component • Relies on Spark core • => tries to pin data in memory

‹#›© Cloudera, Inc. All rights reserved.

Note!

Mahout Scala & Spark Bindings are integrated in Mahout. Version 0.10 release next month will move these methods The Scala DSL for linear algebra:

val g = bt.t %*% bt - c - c.t + (s_q cross s_q) * (xi dot xi)

‹#›© Cloudera, Inc. All rights reserved.

Performance Comparisons

‹#›© Cloudera, Inc. All rights reserved.

[3]

‹#›© Cloudera, Inc. All rights reserved.

MapReduce

[4]

‹#›© Cloudera, Inc. All rights reserved.

Go Bananas tuning!

[5]

‹#›© Cloudera, Inc. All rights reserved.

My Cluster6 Nodes running CDH 5.3* Per Node: 2 physical cores 24, with hyper threading => 144 total available cores 64 GB Memory 100 TB free in HDFS !*Running Spark 1.3

[6]

‹#›© Cloudera, Inc. All rights reserved.

What am I factorizing?

[7]

‹#›© Cloudera, Inc. All rights reserved.

What am I timing?

[8]

‹#›© Cloudera, Inc. All rights reserved.

Think of the polar bears

[9]

‹#›© Cloudera, Inc. All rights reserved.

Varying Columns

‹#›© Cloudera, Inc. All rights reserved.

Varying Rows

‹#›© Cloudera, Inc. All rights reserved.

Varying Sparsity

‹#›© Cloudera, Inc. All rights reserved.

Progress in Numerical Computation

[10]

‹#›© Cloudera, Inc. All rights reserved.

1. Genome PCA: http://bit.ly/1OxXMRy 2. SVD at LANL: http://bit.ly/193IIdY 3. Apples and Oranges: http://bit.ly/1xd1Q4d 4. Sound Board: http://bit.ly/19okavV 5. Bananas: http://bit.ly/1EGxh4p 6. Eniac: http://bit.ly/1F0GOWC 7. Big data pix tumblr: http://bigdatapix.tumblr.com/ 8. Watch: http://bit.ly/1FZtIKX 9. Polar Bears: http://bit.ly/1G0gXQw 10.Progress in numerical computing: http://bit.ly/1ID8WR5

Thanks for the images!

‹#›© Cloudera, Inc. All rights reserved.

Thanks!juliet@cloudera.com @j_houg https://github.com/jhlch/svd-benchmark

top related