towards a comprehensive machine learning benchmark
TRANSCRIPT
Towards a Comprehensive Machine Learning
BenchmarkDr. Amitai Armon
Data Science Lead, Advanced Analytics, Intel
How Machine Learning Helps Intel
Improvingprocesses
Driving new offerings
Design Manufacturing Marketing & Sales
Parkinson’s Disease Monitoring
Cloud Analytics Platform
A Machine Learning Benchmark is Obviously a Must
… but how can it incorporate the diversity of this domain and the ongoing and future changes?
Our Basic Approach: Cover the Building Blocks
We observed that the various Machine Learning algorithms are composed of several types of building blocks - these building blocks should be handled well
The Machine Learning Building Block Types
• ML basic building blocks1. Linear Algebra2. Measures3. Special Functions4. Mathematical Optimization5. Data Characteristics6. Data-dependent Compute7. Memory Access 8. Very large models9. Hybrid Methods
• ML Meta building blocks1. Learning Protocols2. Learning Phases3. Algorithmic Flow and Structure
Compute
Data
Compute - Data Interplay
Process
Machine Learning Building Blocks: Example• ML basic building blocks
1. Linear Algebra2. Measures3. Special Functions4. Optimization Problems5. Data Characteristics6. Data-dependent Compute7. Memory Access 8. Very large models9. Hybrid Methods
• ML Meta building blocks1. Learning Protocols2. Learning Phases3. Algorithmic Flow and Structure
Linear Algebra• GEMM
• , , • Quadratic Form -
• Commonly used Algorithms • Inversion • Matrix Factorization• Eigendecomposition• Singular Value Decomposition (SVD)
• Need to support both Dense and Sparse• Special Matrices of interest
• Symmetric – Covariance, Kernel• Stochastic – Row elements sum to 1• Boolean • Diagonal
Machine Learning Building Blocks: Example
• ML basic building blocks1. Linear Algebra2. Measures3. Special Functions4. Mathematical Optimization5. Data Characteristics6. Data-dependent Compute7. Memory Access 8. Very large models9. Hybrid Methods
• ML Meta building blocks1. Learning Protocols2. Learning Phases3. Algorithmic Flow and Structure
Data Characteristics• Type and Format
• Numeric/Categorical –16b, 32b, 64b • Sparse and Dense
• Typical sizes, Sparsity structure• Distribution
• Univariate, Dependency Structure, Mixture, Even/Biased Class, Separability
# of features Feature types Sparse/ Dense
Usages Small Mid Large Categorical
Numeric
Time Series
Sparse Dense
AdvertisingSNAClinicalGenomicsTelcoIoT
Example: Mapping Algorithms to Building BlocksPCA Decision Tree Deep Learning - CNN Apriori Adaboost
Linear Algebra GEMM Convolution, GEMM Measures Infotheo, Gini Infotheo, Euclidean,
Softmax
Special Functions log sigmoid, tanh, ReLU exp
Mathematical Optimization Non-convex Data Characteristics
Categorical + + +
Numeric + + + +
Data-Dep. Compute Sorting, Bucketing, Data-dep. Branches Counting, Bucketing,
Data-dep. Branches
Memory Access
Blocks + + Columns + Other Predicate-based,
Associative Weighted Sampling
Very Large Models + Hybrid Method Committee
Application: A Machine Learning Workloads Suite
• Building Blocks Coverage (partial list …)• Linear Algebra – GEMM, Inv., Factorization, …• Measures – Euclidean, InfoGain, RBF, …• Special Functions – log, exp, …• Math. Optimization – QP, EM, L-BFGS, SGD , …• Data – Num., Cat., Dense, Sparse, Feat. Dep., …• Data-dep. Compute – Sort, Bucket, KD Tree, …• Memory Access – Seq, Indexed, Pred, Rnd, …• Very large models – CNN, KNN, K-SVM, …
Building Block Type
Algorithms
Dense Linear Algebra
K-MeansSVMPCA
GMM Logistic Reg.
Sparse Linear Algebra
K-MeansSVMPCA
Logistic Reg.ALS
Data Dependency AprioriDecision TreeNaïve BayesKNN
LDAWalktrap
Large Models CNN
Our approach enables selecting representatives of the major building blocks
• Tasks Coverage: Classification, Clustering, Recommendation,
Dimensionality Reduction, Rule Induction, Community Detection
Building Block Type
Algorithms Data sets
Dense Linear Algebra
K-MeansSVMPCA
GMM Logistic Reg.
Clustered
Sparse Linear Algebra
K-MeansSVMPCA
Logistic Reg.ALS
Graphs Text
Data Dependency AprioriDecision TreeNaïve BayesKNN
LDAWalktrap
Clustered GraphsBio Informatics Text Manufacturing
Large Models CNN Images
Which Datasets to Use?There are publicly available datasets, but they may not cover all relevant sizes and
characteristics. We complement them by simulating data.
• Power Law graph• Small World graph
(regular/random)• SBM (few/many blocks)
DensityX X Size
http://snap.stanford.edu/data/
Which Datasets to Use: Another Example• Simulated Dense Clustered Datasets
vary by• Number of dimensions• Number of samples• Number of clusters• Mixing proportion
• Uniform, Power-law• Dependency structure• Cluster separation
• c-separation*• Alignment in space
• Scattered, Line, Sphere
* Dasgupta, S., Schulman, L., A Probabilistic Analysis of EM for Mixtures of Separated, Spherical Gaussians. JMLR, 8 (2007)
Which Parameters / Configurations to Use?
We should use each algorithm with implementations, configurations and parameters that will express all its building blocks
Isn’t It Too Big for a Benchmark?The benchmark should be concise – it should not contain dozens of thousands of separate executions (for each algorithm, dataset and configuration)
Reducing the Number of WorkloadsWe developed a WOrkload Optimization Framework (WOOF), which enables running many executions and clustering them by hardware or software profiles
We then select one representative for each bottleneck behavior
Software ProfilingSoftware behavior is evaluated using the Perf Linux tool. Thousands of executions are reduced into few representatives of the behaviors encountered.• Radial SVM:
• Linear SVM:
Hardware ProfilingHardware behavior is evaluated using Yasin’s Top-Down methodology(*), identifying the percentage of time spent on each of the processor hotspots
(*) A. Yasin – Top Down Analysis: never lost with Xeon perf counters – CERN workshop (2013)
Hardware Profiling: Community Detection Example
Multiple executions of community detection algorithms are reduced into five representatives of different hardware behaviors
(*) L1, L2 and L3 are the three levels of caches of the processor
Hardware Profiling: Illustrating the Effect of Data Selection
Alternate Least Squares (ALS)
Different data characteristics cause different hardware profiles, and simulated data may introduce additional behaviors (projected on two dimensions using PCA)
Benchmark Building Process
Algorithm Selection
Defining Parameter Sets
Defining Datasets
Reducing to Representatives
Results Analysis/Validation
Current Status and Next Steps• Based on the above process we analyzed 18 algorithms, representing the main
building blocks, with multiple datasets and configurations, and built a suit of 50 machine-learning workloads
• Both software developers and hardware architects inside Intel started to use it and gained interesting insights
• Work on completing the benchmark is currently in progress
Acknowledgments• This project was initiated and guided by Dr. Shai Fine, Advanced Analytics, Intel• Much of the presented results and analysis is due to the intensive work of the
Advanced Analytics WOOF team:Chen Admati, Omer Barak, Omer Ben-porat, Roy Ben-shimol, Amir Chanovsky, Nufar Gaspar, Dima Hanukaev, Tom Hope, Litan Ilany, Nitzan Kalvari, Oren David Kimhi, Hagar Loeub, Michal Moran, Jacob Neiman, Yevgeni Nous, Yevgeni Reif , Yahav Shadmiy, Gilad Wallach
• Additional valuable contributions were made by:Assaf Araki, Ehud Cohen, Jason Dai, Boris Ginzburg, Sergey Goffman, Paul Kandel, Sergey Maidanov, Debbie Marr, Andrey Nikolaev, Gilad Olswang, Nir Peled, Ananth Sankaranarayanan, Nadathur Rajagopalan Satish, Ganesh Venkatesh, Brian D Womack