machine learning systemml v3 -...
TRANSCRIPT
Machine Learning and SystemML
Nikolay ManchevData Scientist EuropeE-mail: [email protected]
@nikolaymanchev
© 2016 International Business Machines Corporation 2
A Simple Problem
• In this activity, you will analyze the relationship between educational attainment and median income using data from the ACS by examining a scatter plot and linear model that best fits that scatter plot and solving problems using the linear equation.
Educational Attainment
Median Income inUSD
Less than high school graduate
19’800
High school graduate 28’500
Some college or associate’s degree
36’000
Bachelor’s degree 49’500
Graduate or professional degree
63’000
© 2016 International Business Machines Corporation 3
Machine Learning
"Field of study that gives computers the ability to learn without being explicitly programmed"
Arthur Samuel, 1959
© 2016 International Business Machines Corporation 4
Advantages
• Machines can handle bigger amounts of data
• Machines can work with high dimensional data
• Machines can work it out faster
© 2016 International Business Machines Corporation 5
Enneract (9 dimensional hypercube)
© 2016 International Business Machines Corporation 6
Use-case #1
• Detecting potential "lemon cars"– 2 million cars– 8’000 cars reacquired – 10 million repair cases– 25 million parts exchanges
• Logistic regression model– 22’000 input features– Improved precision/recall by an order of magnitude
© 2016 International Business Machines Corporation 7
Machine Learning
Supervised Machine Learning
• We provide a training set of labelled examples and fit a model to predict the correct labels using the features.
Unsupervised Machine Learning
• No desired output is provided. The model finds similarities in the data based on the features alone.
© 2016 International Business Machines Corporation 8
Use-case #2
• Large Holiday operator• Looking to enrich their web shop with custom recommendations
Search Result Recommend
all inclusive Canary Islands
• Sardinia• Sicily• Majorca• Ibiza
© 2016 International Business Machines Corporation 9
Piece of cake
Collaborative filtering
• Based on user to item rating matrix
• Computes similarity measure between users
• Make a prediction
Sardinia Majorca … Aspen
User #1 4 - … 1
User #2 - - … 5
… … … … …
User #n - 5 … -
© 2016 International Business Machines Corporation 10
Unsupervised learning to the rescue
• Mixture of Gaussians model
• Based on search strings• n fixed classes• Hand crafted rules tailored to classes
© 2016 International Business Machines Corporation 11
Use-case #2
• Large Holiday operator in the UK• Looking to enrich their web shop with custom recommendations
Search Classifier Recommend
all inclusive, H10 Rubicon,
Regency Country Club, Taurito Princess
1. Sardinia2. Sicily3. Majorca4. Ibiza
1. Corralejo2. Costa Calma3. Barracuda Point
© 2016 International Business Machines Corporation 12
It’s Big Data
© 2016 International Business Machines Corporation 13
Why Spark
HDFSRead
HDFSWrite
HDFSRead
HDFSWrite
Input ResultCPU
Iteration 1Memory CPU
Iteration 2Memory
• Traditional approach – MapReduce jobs
HDFSRead
Input CPU
Iteration 1Memory CPU
Iteration 2Memory
faster than network & disk
ZeroRead/Write
Disk Bottleneck
Chain Job Output
into New Job Input
• The Spark approach – keep data in memory, distribute the execution
© 2016 International Business Machines Corporation 14
IBM’s Commitment to Spark
Official announcement (15th June 2015)• IBM will build Spark into the core of its analytics and commerce platforms• IBM will commit over 3,500 researchers & developers to work on Spark-related projects
© 2016 International Business Machines Corporation 15
A Simple Problem
• In this activity, you will analyze the relationship between educational attainment and median income using data from the ACS by examining a scatter plot and linear model that best fits that scatter plot and solving problems using the linear equation.
Median Income Educational Attainment in USD
Less than high school graduate
19’800
High school graduate 28’500
Some college or associate’s degree
36’000
Bachelor’s degree 49’500
Graduate or professional degree
63’000
© 2016 International Business Machines Corporation 16
Find the best fitting line
© 2016 International Business Machines Corporation 17
We always look for patterns
© 2016 International Business Machines Corporation 18
Use case #3
• Predictive model for a bank campaign• We want to predict successful outcomes
© 2016 International Business Machines Corporation 19
You need Data Scientists
Algorithms are NOT the problemUnderstanding what data goes into those algorithms and how to interpret the results is the crux of the matter
Be very, very carefulInvolving a data scientist after you've gathered the data is like involving a doctor after the patient....
© 2016 International Business Machines Corporation 20
IBM’s Commitment to Spark
Official announcement (15th June 2015)• IBM will build Spark into the core of its analytics and commerce platforms• IBM will commit over 3,500 researchers & developers to work on Spark-related projects• IBM will educate more than 1’000’000 data scientists on Spark
© 2016 International Business Machines Corporation 21
Big Data University - free online training http://bigdatauniversity.com/
© 2016 International Business Machines Corporation 22
Data Science before “Big Data”
© 2016 International Business Machines Corporation 23
Enter “Big Data”
© 2016 International Business Machines Corporation 24
Obvious solution “Big Data”
© 2016 International Business Machines Corporation 25
IBM’s Commitment to Spark
Official announcement (15th June 2015)• IBM will build Spark into the core of its analytics and commerce platforms• IBM will commit over 3,500 researchers & developers to work on Spark-related projects• IBM will educate more than 1’000’000 data scientists on Spark• IBM will IBM will open source SystemML and collaborate with Databricks to advance Spark’s machine learning capabilities
© 2016 International Business Machines Corporation 26
Linear Regression Refresher
• Simple Linear Regression– Dependent variable (y)– Independent variables (X)
• In order to estimate the parameters we have to minimize
• There is an elegant solution that minimizes :
We can solve using Ra = t(X) %*% X + diag(lambda);b = t(X) %*% y;theta = solve(a,b);
© 2016 International Business Machines Corporation 27
Linear Regression - Executiona = t(X) %*% X + diag(lambda);b = t(X) %*% y;theta = solve(a,b);
X y
500 features300M observations
4TB text file
300M observations9GB text file
Cluster Configuration3.5 GB Map Task JVM
7 GB In-memory Master JVM128 MB HDFS block size
X
.
.
1k
1k
yT
MAP MAP MAP…
REDUCE
a bT
XTX for each yTX for each 1k
1kIn-memory computation
(a,b) < 2 MB1. get b2. call solve(a,b)
© 2016 International Business Machines Corporation 28
Changes that impact our implementation
• 3 times more attributes
• 2 times more observations
• The dataset fits in memory
• Cluster configuration change
………
… … … …
1’500
300M
500
600M
100
1M
Cluster Configuration3.5 GB Map Task JVM7 GB In-memory Master JVM128 MB HDFS block size
Cluster Configuration3.5 GB Map Task JVM7 GB In-memory Master JVM128 MB HDFS block size
Cluster Configuration3.5 GB Map Task JVM7 GB In-memory Master JVM128 MB HDFS block size
…
500
300M
Cluster Configuration1.5 GB Map Task JVM7 GB In-memory Master JVM128 MB HDFS block size
XTX
XTX
XTy
solve(a,b)
XTX
XTy
XTy
solve(a,b)
solve(XTX, XTy)
XTX
XTy
XTy
solve(a,b)
© 2016 International Business Machines Corporation 29
To Summarize
• 3 lines of code• Minor changes in the data set / cluster configuration result in– 4 dramatically different execution plans– major change in performance– best solution becomes a non-working solution
• How can we manage this?
© 2016 International Business Machines Corporation 30
What’s in the SystemML box
High-Level Operations (HOPs)
General representation of statements in the data analysis language
Low-Level Operations (LOPs)General representation of operations in the runtime framework
High-level language front-ends
Multiple executionenvironments
© 2016 International Business Machines Corporation 31
Backend performance
© 2016 International Business Machines Corporation 32
Out-of-the-box algorithmsCategory Description
Descriptive Statistics Univariate, Bivariate, Stratified Bivariate
Classification Logistic Regression, Multi-class SVM, Naïve Bayes, Decision Trees, Random Forest
Clustering k-Means
Regression Linear Regression (System of equations, SGD)
Generalised Linear Models Distributions: Gaussian, Poisson, Gamma, Inverse Gaussian, Binomial, Bernoulli
Links for all distributions: identity, log, sq. root, inverse, 1/μ^2
Links for Binomial/ Bernoulli: logit, probit, cloglog, cauchit
Stepwise Linear, GLM
Dimensionality Reduction PCA
Matrix Factorization ALS
Survival Models Kaplan Meier, Cox
Predict Scoring
Transformation Recoding, dummy coding, binning, scaling, missing value imputation
© 2016 International Business Machines Corporation 33
Summary
• Key features– Cost based compilation– Out-of-the-box scalable machine learning algorithms– Support for custom algorithms• Write your own code and don’t worry about scalability, numeric stability, and optimization
• Use it standalone, with MR backend, or with Spark backend– Fit into Spark APIs, consume and produce DataFrames– ML Pipeline integration– Use System ML from Scala, Java, Python, R/SparkR– BigR integration (package)
© 2016 International Business Machines Corporation 34
Additional Resources
SystemML is available on GitHubhttps://github.com/SparkTC/systemml
An in-depth scientific perspective • Ghoting, Amol, et al. "SystemML: Declarative machine learning on MapReduce.“, ICDE 2011• Boehm, Matthias, et al. “SystemML’s Optimizer: Plan Generation for Large-Scale Machine Learning Programs.”. IEEE Data Eng. Bull 37.3 (2014).• Huang, Botong, et al. "Resource Elasticity for Large-Scale Machine Learning.“, SIGMOD 2015.
IBM big data • IBM big data • IBM big data
IBM big data • IBM big data • IBM big data
IBM big data • IBM big data
IBM big data • IBM
big data
THINK