towards a comprehensive machine learning benchmark

Towards a Comprehensive Machine Learning

BenchmarkDr. Amitai Armon

Data Science Lead, Advanced Analytics, Intel

How Machine Learning Helps Intel

Improvingprocesses

Driving new offerings

Design Manufacturing Marketing & Sales

Parkinson’s Disease Monitoring

Cloud Analytics Platform

How Can Intel Help Machine Learning?

PC World, May 2015

How Can We Tell What Should Be Improved?

Many algorithms, many data types, constantly evolving…

A Machine Learning Benchmark is Obviously a Must

… but how can it incorporate the diversity of this domain and the ongoing and future changes?

Our Basic Approach: Cover the Building Blocks

We observed that the various Machine Learning algorithms are composed of several types of building blocks - these building blocks should be handled well

The Machine Learning Building Block Types

• ML basic building blocks1. Linear Algebra2. Measures3. Special Functions4. Mathematical Optimization5. Data Characteristics6. Data-dependent Compute7. Memory Access 8. Very large models9. Hybrid Methods

• ML Meta building blocks1. Learning Protocols2. Learning Phases3. Algorithmic Flow and Structure

Compute

Data

Compute - Data Interplay

Process

Machine Learning Building Blocks: Example• ML basic building blocks

1. Linear Algebra2. Measures3. Special Functions4. Optimization Problems5. Data Characteristics6. Data-dependent Compute7. Memory Access 8. Very large models9. Hybrid Methods


Linear Algebra• GEMM

• , , • Quadratic Form -

• Commonly used Algorithms • Inversion • Matrix Factorization• Eigendecomposition• Singular Value Decomposition (SVD)

• Need to support both Dense and Sparse• Special Matrices of interest

• Symmetric – Covariance, Kernel• Stochastic – Row elements sum to 1• Boolean • Diagonal

Machine Learning Building Blocks: Example

• ML basic building blocks1. Linear Algebra2. Measures3. Special Functions4. Mathematical Optimization5. Data Characteristics6. Data-dependent Compute7. Memory Access 8. Very large models9. Hybrid Methods


Data Characteristics• Type and Format

• Numeric/Categorical –16b, 32b, 64b • Sparse and Dense

• Typical sizes, Sparsity structure• Distribution

• Univariate, Dependency Structure, Mixture, Even/Biased Class, Separability

# of features Feature types Sparse/ Dense

Usages Small Mid Large Categorical

Numeric

Time Series

Sparse Dense

AdvertisingSNAClinicalGenomicsTelcoIoT

Example: Mapping Algorithms to Building BlocksPCA Decision Tree Deep Learning - CNN Apriori Adaboost

Linear Algebra GEMM Convolution, GEMM Measures Infotheo, Gini Infotheo, Euclidean,

Softmax

Special Functions log sigmoid, tanh, ReLU exp

Mathematical Optimization Non-convex Data Characteristics

Categorical + + +

Numeric + + + +

Data-Dep. Compute Sorting, Bucketing, Data-dep. Branches Counting, Bucketing,

Data-dep. Branches

Memory Access

Blocks + + Columns + Other Predicate-based,

Associative Weighted Sampling

Very Large Models + Hybrid Method Committee

Application: A Machine Learning Workloads Suite

• Building Blocks Coverage (partial list …)• Linear Algebra – GEMM, Inv., Factorization, …• Measures – Euclidean, InfoGain, RBF, …• Special Functions – log, exp, …• Math. Optimization – QP, EM, L-BFGS, SGD , …• Data – Num., Cat., Dense, Sparse, Feat. Dep., …• Data-dep. Compute – Sort, Bucket, KD Tree, …• Memory Access – Seq, Indexed, Pred, Rnd, …• Very large models – CNN, KNN, K-SVM, …

Building Block Type

Algorithms

Dense Linear Algebra

K-MeansSVMPCA

GMM Logistic Reg.

Sparse Linear Algebra

K-MeansSVMPCA

Logistic Reg.ALS

Data Dependency AprioriDecision TreeNaïve BayesKNN

LDAWalktrap

Large Models CNN

Our approach enables selecting representatives of the major building blocks

• Tasks Coverage: Classification, Clustering, Recommendation,

Dimensionality Reduction, Rule Induction, Community Detection

Building Block Type

Algorithms Data sets

Dense Linear Algebra

K-MeansSVMPCA

GMM Logistic Reg.

Clustered

Sparse Linear Algebra

K-MeansSVMPCA

Logistic Reg.ALS

Graphs Text

Data Dependency AprioriDecision TreeNaïve BayesKNN

LDAWalktrap

Clustered GraphsBio Informatics Text Manufacturing

Large Models CNN Images

Which Datasets to Use?There are publicly available datasets, but they may not cover all relevant sizes and

characteristics. We complement them by simulating data.

• Power Law graph• Small World graph

(regular/random)• SBM (few/many blocks)

DensityX X Size

http://snap.stanford.edu/data/



Which Datasets to Use: Another Example• Simulated Dense Clustered Datasets

vary by• Number of dimensions• Number of samples• Number of clusters• Mixing proportion

• Uniform, Power-law• Dependency structure• Cluster separation

• c-separation*• Alignment in space

• Scattered, Line, Sphere

* Dasgupta, S., Schulman, L., A Probabilistic Analysis of EM for Mixtures of Separated, Spherical Gaussians. JMLR, 8 (2007)

Which Datasets to Use: Labeled Dense Clustered Data

Which Parameters / Configurations to Use?

We should use each algorithm with implementations, configurations and parameters that will express all its building blocks

Isn’t It Too Big for a Benchmark?The benchmark should be concise – it should not contain dozens of thousands of separate executions (for each algorithm, dataset and configuration)

Reducing the Number of WorkloadsWe developed a WOrkload Optimization Framework (WOOF), which enables running many executions and clustering them by hardware or software profiles

We then select one representative for each bottleneck behavior

Software ProfilingSoftware behavior is evaluated using the Perf Linux tool. Thousands of executions are reduced into few representatives of the behaviors encountered.• Radial SVM:

• Linear SVM:

Hardware ProfilingHardware behavior is evaluated using Yasin’s Top-Down methodology(*), identifying the percentage of time spent on each of the processor hotspots

(*) A. Yasin – Top Down Analysis: never lost with Xeon perf counters – CERN workshop (2013)

Hardware Profiling: Community Detection Example

Multiple executions of community detection algorithms are reduced into five representatives of different hardware behaviors

(*) L1, L2 and L3 are the three levels of caches of the processor

Hardware Profiling: Community Detection Example

Hardware Profiling: Illustrating the Effect of Data Selection

Alternate Least Squares (ALS)

Different data characteristics cause different hardware profiles, and simulated data may introduce additional behaviors (projected on two dimensions using PCA)

Benchmark Building Process

Algorithm Selection

Defining Parameter Sets

Defining Datasets

Reducing to Representatives

Results Analysis/Validation

Current Status and Next Steps• Based on the above process we analyzed 18 algorithms, representing the main

building blocks, with multiple datasets and configurations, and built a suit of 50 machine-learning workloads

• Both software developers and hardware architects inside Intel started to use it and gained interesting insights

• Work on completing the benchmark is currently in progress

Acknowledgments• This project was initiated and guided by Dr. Shai Fine, Advanced Analytics, Intel• Much of the presented results and analysis is due to the intensive work of the

Advanced Analytics WOOF team:Chen Admati, Omer Barak, Omer Ben-porat, Roy Ben-shimol, Amir Chanovsky, Nufar Gaspar, Dima Hanukaev, Tom Hope, Litan Ilany, Nitzan Kalvari, Oren David Kimhi, Hagar Loeub, Michal Moran, Jacob Neiman, Yevgeni Nous, Yevgeni Reif , Yahav Shadmiy, Gilad Wallach

• Additional valuable contributions were made by:Assaf Araki, Ehud Cohen, Jason Dai, Boris Ginzburg, Sergey Goffman, Paul Kandel, Sergey Maidanov, Debbie Marr, Andrey Nikolaev, Gilad Olswang, Nir Peled, Ananth Sankaranarayanan, Nadathur Rajagopalan Satish, Ganesh Venkatesh, Brian D Womack

Thank you!

towards a comprehensive machine learning benchmark

Documents

types of building blocks

basic building blocks

meta building blocks

data types

learning protocols

learning phases

datadependent compute

intel help machine learning