scaling multivariate statistics to massive data algorithmic problems and approaches

24
Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches Alexander Gray Georgia Institute of Technology www.fast-lab.org

Upload: chester-christensen

Post on 31-Dec-2015

35 views

Category:

Documents


3 download

DESCRIPTION

Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches. Alexander Gray Georgia Institute of Technology www.fast-lab.org. Core methods of statistics / machine learning / mining. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches

Scaling Multivariate Statistics to Massive Data

Algorithmic problems and approaches

Alexander GrayGeorgia Institute of Technology

www.fast-lab.org

Page 2: Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches

Core methods ofstatistics / machine learning / mining

1. Querying: spherical range-search O(N), orthogonal range-search O(N), spatial join O(N2), nearest-neighbor O(N), all-nearest-neighbors O(N2)

2. Density estimation: mixture of Gaussians, kernel density estimation O(N2), kernel conditional density estimation O(N3)

3. Regression: linear regression, kernel regression O(N2), Gaussian process regression O(N3)

4. Classification: decision tree, nearest-neighbor classifier O(N2), nonparametric Bayes classifier O(N2), support vector machine O(N3)

5. Dimension reduction: principal component analysis, non-negative matrix factorization, kernel PCA O(N3), maximum variance unfolding O(N3)

6. Outlier detection: by density estimation or dimension reduction7. Clustering: by density estimation or dimension reduction, k-means, mean-

shift segmentation O(N2), hierarchical clustering O(N3)8. Time series analysis: Kalman filter, hidden Markov model, trajectory

tracking O(Nn)9. Feature selection and causality: LASSO, L1 SVM, Gaussian graphical

models, discrete graphical models10.Fusion and matching: sequence alignment, bipartite matching O(N3), n-

point correlation 2-sample testing O(Nn)

Page 3: Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches

Now pretty fast (2011)…

1. Querying: spherical range-search O(logN)*, orthogonal range-search O(logN)*, spatial join O(N)*, nearest-neighbor O(logN), all-nearest-neighbors O(N)

2. Density estimation: mixture of Gaussians, kernel density estimation O(N), kernel conditional density estimation O(Nlog3)*

3. Regression: linear regression, kernel regression O(N), Gaussian process regression O(N)*

4. Classification: decision tree, nearest-neighbor classifier O(N), nonparametric Bayes classifier O(N)*, support vector machine

5. Dimension reduction: principal component analysis, non-negative matrix factorization, kernel PCA O(N)*, maximum variance unfolding O(N)*

6. Outlier detection: by density estimation or dimension reduction7. Clustering: by density estimation or dimension reduction, k-means, mean-

shift segmentation O(N), hierarchical clustering O(NlogN)8. Time series analysis: Kalman filter, hidden Markov model, trajectory

tracking O(Nlogn)*9. Feature selection and causality: LASSO, L1 SVM, Gaussian graphical

models, discrete graphical models10.Fusion and matching: sequence alignment, bipartite matching O(N)**, n-

point correlation 2-sample testing O(Nlogn)*

Page 4: Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches

Things we made fastfastest, fastest in some settings

1. Querying: spherical range-search O(logN)*, orthogonal range-search O(logN)*, spatial join O(N)*, nearest-neighbor O(logN), all-nearest-neighbors O(N)

2. Density estimation: mixture of Gaussians, kernel density estimation O(N), kernel conditional density estimation O(Nlog3)*

3. Regression: linear regression, kernel regression O(N), Gaussian process regression O(N)*

4. Classification: decision tree, nearest-neighbor classifier O(N), nonparametric Bayes classifier O(N)*, support vector machine O(N)/O(N2)

5. Dimension reduction: principal component analysis, non-negative matrix factorization, kernel PCA O(N)*, maximum variance unfolding O(N)*

6. Outlier detection: by density estimation or dimension reduction7. Clustering: by density estimation or dimension reduction, k-means, mean-

shift segmentation O(N), hierarchical (FoF) clustering O(NlogN)8. Time series analysis: Kalman filter, hidden Markov model, trajectory

tracking O(Nlogn)*9. Feature selection and causality: LASSO, L1 SVM, Gaussian graphical

models, discrete graphical models10.Fusion and matching: sequence alignment, bipartite matching O(N)**, n-

point correlation 2-sample testing O(Nlogn)*

Page 5: Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches

Core computational problems

What are the basic mathematical operations making things hard?

• Alternative to speeding up each of the 1000s of statistical methods: treat common computational bottlenecks

• Divide up the space of problems (and associated algorithmic strategies), so we can examine the unique challenges and possible ways forward within each

Page 6: Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches

The “7 Giants” of data

1. Basic statistics

2. Generalized N-body problems

3. Graph-theoretic problems

4. Linear-algebraic problems

5. Optimizations

6. Integrations

7. Alignment problems

Page 7: Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches

The “7 Giants” of data

1. Basic statistics•e.g. counts, contingency tables, means, medians, variances, range queries (SQL queries)

2. Generalized N-body problems•e.g. nearest-nbrs (in NLDR, etc), kernel summations (in KDE, GP, SVM, etc), clustering, MST, spatial correlations

Page 8: Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches

The “7 Giants” of data

3. Graph-theoretic problems•e.g. betweenness centrality, commute distance, graphical model inference

4. Linear-algebraic problems•e.g. linear algebra, PCA, Gaussian process regression, manifold learning

5. Optimizations•e.g. LP/QP/SDP/SOCP/MINLPs in regularized methods, compressed sensing

Page 9: Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches

The “7 Giants” of data

6. Integrations•e.g. Bayesian inference

7. Alignment problems•e.g. BLAST in genomics, string matching, phylogenies, SLAM, cross-match

Page 10: Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches

Back to our listbasic, N-body, graphs, linear algebra, optimization, integration, alignment

1. Querying: spherical range-search O(N), orthogonal range-search O(N), spatial join O(N2), nearest-neighbor O(N), all-nearest-neighbors O(N2)

2. Density estimation: mixture of Gaussians, kernel density estimation O(N2), kernel conditional density estimation O(N3)

3. Regression: linear regression, kernel regression O(N2), Gaussian process regression O(N3)

4. Classification: decision tree, nearest-neighbor classifier O(N2), nonparametric Bayes classifier O(N2), support vector machine O(N3)

5. Dimension reduction: principal component analysis, non-negative matrix factorization, kernel PCA O(N3), maximum variance unfolding O(N3)

6. Outlier detection: by density estimation or dimension reduction7. Clustering: by density estimation or dimension reduction, k-means, mean-

shift segmentation O(N2), hierarchical clustering O(N3)8. Time series analysis: Kalman filter, hidden Markov model, trajectory

tracking O(Nn)9. Feature selection and causality: LASSO, L1 SVM, Gaussian graphical

models, discrete graphical models10.Fusion and matching: sequence alignment, bipartite matching O(N3), n-

point correlation 2-sample testing O(Nn)

Page 11: Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches

5 settings

1. “Regular”: batch, in-RAM/core, one CPU

2. Streaming (non-batch)

3. Disk (out-of-core)

4. Distributed: threads/multi-core (shared memory)

5. Distributed: clusters/cloud (distributed memory)

Page 12: Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches

4 common data types

1. Vector data, iid

2. Time series

3. Images

4. Graphs

Page 13: Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches

3 desiderata

1. Fast experimental runtime/performance*

2. Fast theoretic (provable) runtime/performance*

3. Accuracy guarantees

*Performance: runtime, memory, communication, disk accesses; time-constrained, anytime, etc.

Page 14: Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches

7 general solution strategies

1. Divide and conquer (indexing structures)

2. Dynamic programming

3. Function transforms

4. Random sampling (Monte Carlo)

5. Non-random sampling (active learning)

6. Parallelism

7. Problem reduction

Page 15: Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches

1. Summary statistics

• Examples: counts, contingency tables, means, medians, variances, range queries (SQL queries)

• What’s unique/challenges: streaming, new guarantees

• Promising/interesting: – Sketching approaches– AD-trees– MapReduce/Hadoop (Aster,Greenplum,Netezza)

Page 16: Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches

2. Generalized N-body problems

• Examples: nearest-nbrs (in NLDR, etc), kernel summations (in KDE, GP, SVM, etc), clustering, MST, spatial correlations

• What’s unique/challenges: general dimension, non-Euclidean, new guarantees (e.g. in rank)

• Promising/interesting: – Generalized/higher-order FMM O(N2) O(N)

– Random projections

– GPUs

Page 17: Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches

3. Graph-theoretic problems

• Examples: betweenness centrality, commute dist, graphical model inference

• What’s unique/challenges: high interconnectivity (cliques), out-of-core

• Promising/interesting: – Variational methods– Stochastic composite likelihood methods– MapReduce/Hadoop (Facebook,etc)

Page 18: Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches

4. Linear-algebraic problems

• Examples: linear algebra, PCA, Gaussian process regression, manifold learning

• What’s unique/challenges: probabilistic guarantees, kernel matrices

• Promising/interesting: – Sampling-based methods– Online methods– Approximate matrix-vector multiply via N-body

Page 19: Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches

5. Optimizations

• Examples: LP/QP/SDP/SOCP/MINLPs in regularized methods, compressed sensing

• What’s unique/challenges: stochastic programming, streaming

• Promising/interesting: – Reformulations/relaxations of various ML forms– Online, mini-batch methods– Parallel online methods– Submodular functions– Global optimization (non-convex)

Page 20: Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches

6. Integrations

• Examples: Bayesian inference

• What’s unique/challenges: general dimension

• Promising/interesting: – MCMC– ABC– Particle filtering– Adaptive importance sampling, active learning

Page 21: Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches

7. Alignments

• Examples: BLAST in genomics, string matching, phylogenies, SLAM, cross-match

• What’s unique/challenges: greater heterogeneity, measurement errors

• Promising/interesting: – Probabilistic representations– Reductions to generalized N-body problems

Page 22: Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches

Reductions/transformationsbetween problems

• Gaussian graphical models linear alg• Bayesian integration MAP optimization• Euclidean graphs N-body problems• Linear algebra on kernel matrices N-body

inside conjugate gradient• Can featurize a graph or any other structure

matrix-based ML problem• Create new ML methods with different

computational properties

Page 23: Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches

General conclusions

• Algorithms can dramatically change the runtime order, e.g. O(N2) to O(N)

• High dimensionality is a persistent challenge• The non-default (e.g. streaming, disk…)

settings need more research work• Systems issues need more work, e.g.

connection to data storage/management• Hadoop does not solve everything

Page 24: Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches

General conclusions

• No general theory for the tradeoff between statistical quality and computational cost (lower/upper bounds, etc)

• More aspects of hardness (statistical and computational) are needed