machine learning systemml v3 -...

Machine Learning and SystemML

Nikolay ManchevData Scientist EuropeE-mail: [email protected]

@nikolaymanchev

© 2016 International Business Machines Corporation 2

A Simple Problem

• In this activity, you will analyze the relationship between educational attainment and median income using data from the ACS by examining a scatter plot and linear model that best fits that scatter plot and solving problems using the linear equation.

Educational Attainment

Median Income inUSD

Less than high school graduate

19’800

High school graduate 28’500

Some college or associate’s degree

36’000

Bachelor’s degree 49’500

Graduate or professional degree

63’000


Machine Learning

"Field of study that gives computers the ability to learn without being explicitly programmed"

Arthur Samuel, 1959


Advantages

• Machines can handle bigger amounts of data

• Machines can work with high dimensional data

• Machines can work it out faster


Enneract (9 dimensional hypercube)


Use-case #1

• Detecting potential "lemon cars"– 2 million cars– 8’000 cars reacquired – 10 million repair cases– 25 million parts exchanges

• Logistic regression model– 22’000 input features– Improved precision/recall by an order of magnitude


Machine Learning

Supervised Machine Learning

• We provide a training set of labelled examples and fit a model to predict the correct labels using the features.

Unsupervised Machine Learning

• No desired output is provided. The model finds similarities in the data based on the features alone.


Use-case #2

• Large Holiday operator• Looking to enrich their web shop with custom recommendations

Search Result Recommend

all inclusive Canary Islands

• Sardinia• Sicily• Majorca• Ibiza


Piece of cake

Collaborative filtering

• Based on user to item rating matrix

• Computes similarity measure between users

• Make a prediction

Sardinia Majorca … Aspen

User #1 4 - … 1

User #2 - - … 5

… … … … …

User #n - 5 … -


Unsupervised learning to the rescue

• Mixture of Gaussians model

• Based on search strings• n fixed classes• Hand crafted rules tailored to classes


Use-case #2

• Large Holiday operator in the UK• Looking to enrich their web shop with custom recommendations

Search Classifier Recommend

all inclusive, H10 Rubicon,

Regency Country Club, Taurito Princess

1. Sardinia2. Sicily3. Majorca4. Ibiza

1. Corralejo2. Costa Calma3. Barracuda Point


It’s Big Data


Why Spark

HDFSRead

HDFSWrite

HDFSRead

HDFSWrite

Input ResultCPU

Iteration 1Memory CPU

Iteration 2Memory

• Traditional approach – MapReduce jobs

HDFSRead

Input CPU

Iteration 1Memory CPU

Iteration 2Memory

faster than network & disk

ZeroRead/Write

Disk Bottleneck

Chain Job Output

into New Job Input

• The Spark approach – keep data in memory, distribute the execution


IBM’s Commitment to Spark

Official announcement (15th June 2015)• IBM will build Spark into the core of its analytics and commerce platforms• IBM will commit over 3,500 researchers & developers to work on Spark-related projects


A Simple Problem

• In this activity, you will analyze the relationship between educational attainment and median income using data from the ACS by examining a scatter plot and linear model that best fits that scatter plot and solving problems using the linear equation.

Median Income Educational Attainment in USD

Less than high school graduate

19’800

High school graduate 28’500

Some college or associate’s degree

36’000

Bachelor’s degree 49’500

Graduate or professional degree

63’000


Find the best fitting line


We always look for patterns


Use case #3

• Predictive model for a bank campaign• We want to predict successful outcomes


You need Data Scientists

Algorithms are NOT the problemUnderstanding what data goes into those algorithms and how to interpret the results is the crux of the matter

Be very, very carefulInvolving a data scientist after you've gathered the data is like involving a doctor after the patient....



Official announcement (15th June 2015)• IBM will build Spark into the core of its analytics and commerce platforms• IBM will commit over 3,500 researchers & developers to work on Spark-related projects• IBM will educate more than 1’000’000 data scientists on Spark


Big Data University - free online training http://bigdatauniversity.com/


Data Science before “Big Data”


Enter “Big Data”


Obvious solution “Big Data”



Official announcement (15th June 2015)• IBM will build Spark into the core of its analytics and commerce platforms• IBM will commit over 3,500 researchers & developers to work on Spark-related projects• IBM will educate more than 1’000’000 data scientists on Spark• IBM will IBM will open source SystemML and collaborate with Databricks to advance Spark’s machine learning capabilities


Linear Regression Refresher

• Simple Linear Regression– Dependent variable (y)– Independent variables (X)

• In order to estimate the parameters we have to minimize

• There is an elegant solution that minimizes :

We can solve using Ra = t(X) %*% X + diag(lambda);b = t(X) %*% y;theta = solve(a,b);


Linear Regression - Executiona = t(X) %*% X + diag(lambda);b = t(X) %*% y;theta = solve(a,b);

X y

500 features300M observations

4TB text file

300M observations9GB text file

Cluster Configuration3.5 GB Map Task JVM

7 GB In-memory Master JVM128 MB HDFS block size

X

.

.

1k

1k

yT

MAP MAP MAP…

REDUCE

a bT

XTX for each yTX for each 1k

1kIn-memory computation

(a,b) < 2 MB1. get b2. call solve(a,b)


Changes that impact our implementation

• 3 times more attributes

• 2 times more observations

• The dataset fits in memory

• Cluster configuration change

………

… … … …

1’500

300M

500

600M

100

1M

Cluster Configuration3.5 GB Map Task JVM7 GB In-memory Master JVM128 MB HDFS block size



…

500

300M


XTX

XTX

XTy

solve(a,b)

XTX

XTy

XTy

solve(a,b)

solve(XTX, XTy)

XTX

XTy

XTy

solve(a,b)


To Summarize

• 3 lines of code• Minor changes in the data set / cluster configuration result in– 4 dramatically different execution plans– major change in performance– best solution becomes a non-working solution

• How can we manage this?


What’s in the SystemML box

High-Level Operations (HOPs)

General representation of statements in the data analysis language

Low-Level Operations (LOPs)General representation of operations in the runtime framework

High-level language front-ends

Multiple executionenvironments


Backend performance


Out-of-the-box algorithmsCategory Description

Descriptive Statistics Univariate, Bivariate, Stratified Bivariate

Classification Logistic Regression, Multi-class SVM, Naïve Bayes, Decision Trees, Random Forest

Clustering k-Means

Regression Linear Regression (System of equations, SGD)

Generalised Linear Models Distributions: Gaussian, Poisson, Gamma, Inverse Gaussian, Binomial, Bernoulli

Links for all distributions: identity, log, sq. root, inverse, 1/μ^2

Links for Binomial/ Bernoulli: logit, probit, cloglog, cauchit

Stepwise Linear, GLM

Dimensionality Reduction PCA

Matrix Factorization ALS

Survival Models Kaplan Meier, Cox

Predict Scoring

Transformation Recoding, dummy coding, binning, scaling, missing value imputation


Summary

• Key features– Cost based compilation– Out-of-the-box scalable machine learning algorithms– Support for custom algorithms• Write your own code and don’t worry about scalability, numeric stability, and optimization

• Use it standalone, with MR backend, or with Spark backend– Fit into Spark APIs, consume and produce DataFrames– ML Pipeline integration– Use System ML from Scala, Java, Python, R/SparkR– BigR integration (package)


Additional Resources

SystemML is available on GitHubhttps://github.com/SparkTC/systemml

An in-depth scientific perspective • Ghoting, Amol, et al. "SystemML: Declarative machine learning on MapReduce.“, ICDE 2011• Boehm, Matthias, et al. “SystemML’s Optimizer: Plan Generation for Large-Scale Machine Learning Programs.”. IEEE Data Eng. Bull 37.3 (2014).• Huang, Botong, et al. "Resource Elasticity for Large-Scale Machine Learning.“, SIGMOD 2015.

IBM big data • IBM big data • IBM big data

IBM big data • IBM big data • IBM big data

IBM big data • IBM big data

IBM big data • IBM

big data

THINK

machine learning systemml v3 -...

Documents