machine learning systemml v3 -...

35
Machine Learning and SystemML Nikolay Manchev Data Scientist Europe Email: [email protected] @nikolaymanchev

Upload: others

Post on 03-Aug-2020

17 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machine Learning SystemML v3 - files.meetup.comfiles.meetup.com/7770922/Machine_Learning_SystemML.pdf · Machine(Learningand SystemML Nikolay(Manchev Data$Scientist$Europe E0mail:$nmanchev@uk.ibm.com

Machine  Learning  and  SystemML

Nikolay  ManchevData  Scientist  EuropeE-­mail:  [email protected]

@nikolaymanchev

Page 2: Machine Learning SystemML v3 - files.meetup.comfiles.meetup.com/7770922/Machine_Learning_SystemML.pdf · Machine(Learningand SystemML Nikolay(Manchev Data$Scientist$Europe E0mail:$nmanchev@uk.ibm.com

©  2016  International  Business  Machines  Corporation 2

A  Simple  Problem

• In  this  activity,  you  will  analyze the  relationship  between  educational  attainment  and  median  income  using  data  from  the  ACS  by  examining  a  scatter  plot  and  linear  model  that  best  fits  that  scatter  plot  and  solving  problems  using  the  linear  equation.

Educational  Attainment

Median  Income  inUSD

Less  than  high  school  graduate

19’800

High  school  graduate 28’500

Some  college  or  associate’s  degree

36’000

Bachelor’s  degree 49’500

Graduate  or  professional degree

63’000

Page 3: Machine Learning SystemML v3 - files.meetup.comfiles.meetup.com/7770922/Machine_Learning_SystemML.pdf · Machine(Learningand SystemML Nikolay(Manchev Data$Scientist$Europe E0mail:$nmanchev@uk.ibm.com

©  2016  International  Business  Machines  Corporation 3

Machine  Learning

"Field  of  study  that  gives  computers  the  ability  to  learn  without  being  explicitly  programmed"

Arthur  Samuel,  1959

Page 4: Machine Learning SystemML v3 - files.meetup.comfiles.meetup.com/7770922/Machine_Learning_SystemML.pdf · Machine(Learningand SystemML Nikolay(Manchev Data$Scientist$Europe E0mail:$nmanchev@uk.ibm.com

©  2016  International  Business  Machines  Corporation 4

Advantages

• Machines  can  handle  bigger  amounts  of  data

• Machines  can  work  with  high  dimensional  data

• Machines  can  work  it  out  faster

Page 5: Machine Learning SystemML v3 - files.meetup.comfiles.meetup.com/7770922/Machine_Learning_SystemML.pdf · Machine(Learningand SystemML Nikolay(Manchev Data$Scientist$Europe E0mail:$nmanchev@uk.ibm.com

©  2016  International  Business  Machines  Corporation 5

Enneract (9  dimensional   hypercube)

Page 6: Machine Learning SystemML v3 - files.meetup.comfiles.meetup.com/7770922/Machine_Learning_SystemML.pdf · Machine(Learningand SystemML Nikolay(Manchev Data$Scientist$Europe E0mail:$nmanchev@uk.ibm.com

©  2016  International  Business  Machines  Corporation 6

Use-­case  #1

• Detecting  potential  "lemon  cars"– 2  million   cars– 8’000  cars  reacquired  – 10  million   repair  cases– 25  million  parts  exchanges

• Logistic  regression  model– 22’000   input  features– Improved  precision/recall  by  an  order  of  magnitude

Page 7: Machine Learning SystemML v3 - files.meetup.comfiles.meetup.com/7770922/Machine_Learning_SystemML.pdf · Machine(Learningand SystemML Nikolay(Manchev Data$Scientist$Europe E0mail:$nmanchev@uk.ibm.com

©  2016  International  Business  Machines  Corporation 7

Machine  Learning

Supervised  Machine  Learning

• We  provide  a  training  set  of  labelled  examples  and  fit  a  model  to  predict  the  correct  labels  using  the  features.

Unsupervised  Machine  Learning

• No  desired  output  is  provided.  The  model  finds  similarities  in  the  data  based  on  the  features  alone.

Page 8: Machine Learning SystemML v3 - files.meetup.comfiles.meetup.com/7770922/Machine_Learning_SystemML.pdf · Machine(Learningand SystemML Nikolay(Manchev Data$Scientist$Europe E0mail:$nmanchev@uk.ibm.com

©  2016  International  Business  Machines  Corporation 8

Use-­case  #2

• Large  Holiday  operator• Looking  to  enrich  their  web  shop  with  custom  recommendations

Search Result Recommend

all inclusive Canary Islands

• Sardinia• Sicily• Majorca• Ibiza

Page 9: Machine Learning SystemML v3 - files.meetup.comfiles.meetup.com/7770922/Machine_Learning_SystemML.pdf · Machine(Learningand SystemML Nikolay(Manchev Data$Scientist$Europe E0mail:$nmanchev@uk.ibm.com

©  2016  International  Business  Machines  Corporation 9

Piece  of  cake

Collaborative  filtering

• Based  on  user  to  item  rating  matrix

• Computes  similarity  measure  between  users

• Make  a  prediction

Sardinia Majorca … Aspen

User  #1 4 -­ … 1

User  #2 -­ -­ … 5

… … … … …

User  #n -­ 5 … -­

Page 10: Machine Learning SystemML v3 - files.meetup.comfiles.meetup.com/7770922/Machine_Learning_SystemML.pdf · Machine(Learningand SystemML Nikolay(Manchev Data$Scientist$Europe E0mail:$nmanchev@uk.ibm.com

©  2016  International  Business  Machines  Corporation 10

Unsupervised   learning   to  the  rescue

• Mixture  of  Gaussians  model

• Based  on  search  strings• n fixed  classes• Hand  crafted  rules  tailored  to  classes

Page 11: Machine Learning SystemML v3 - files.meetup.comfiles.meetup.com/7770922/Machine_Learning_SystemML.pdf · Machine(Learningand SystemML Nikolay(Manchev Data$Scientist$Europe E0mail:$nmanchev@uk.ibm.com

©  2016  International  Business  Machines  Corporation 11

Use-­case  #2

• Large  Holiday  operator  in  the  UK• Looking  to  enrich  their  web  shop  with  custom  recommendations

Search Classifier Recommend

all  inclusive,  H10  Rubicon,

Regency  Country  Club,  Taurito Princess

1. Sardinia2. Sicily3. Majorca4. Ibiza

1. Corralejo2. Costa  Calma3. Barracuda  Point

Page 12: Machine Learning SystemML v3 - files.meetup.comfiles.meetup.com/7770922/Machine_Learning_SystemML.pdf · Machine(Learningand SystemML Nikolay(Manchev Data$Scientist$Europe E0mail:$nmanchev@uk.ibm.com

©  2016  International  Business  Machines  Corporation 12

It’s  Big  Data

Page 13: Machine Learning SystemML v3 - files.meetup.comfiles.meetup.com/7770922/Machine_Learning_SystemML.pdf · Machine(Learningand SystemML Nikolay(Manchev Data$Scientist$Europe E0mail:$nmanchev@uk.ibm.com

©  2016  International  Business  Machines  Corporation 13

Why  Spark

HDFSRead

HDFSWrite

HDFSRead

HDFSWrite

Input ResultCPU

Iteration 1Memory CPU

Iteration 2Memory

• Traditional  approach  – MapReduce  jobs

HDFSRead

Input CPU

Iteration 1Memory CPU

Iteration 2Memory

faster than network & disk

ZeroRead/Write

Disk Bottleneck

Chain Job Output

into New Job Input

• The  Spark  approach  – keep  data  in  memory,  distribute  the  execution

Page 14: Machine Learning SystemML v3 - files.meetup.comfiles.meetup.com/7770922/Machine_Learning_SystemML.pdf · Machine(Learningand SystemML Nikolay(Manchev Data$Scientist$Europe E0mail:$nmanchev@uk.ibm.com

©  2016  International  Business  Machines  Corporation 14

IBM’s  Commitment   to  Spark

Official  announcement  (15th  June  2015)• IBM  will  build  Spark  into  the  core  of  its  analytics  and  commerce  platforms• IBM  will  commit  over  3,500  researchers  &  developers  to  work  on  Spark-­related  projects

Page 15: Machine Learning SystemML v3 - files.meetup.comfiles.meetup.com/7770922/Machine_Learning_SystemML.pdf · Machine(Learningand SystemML Nikolay(Manchev Data$Scientist$Europe E0mail:$nmanchev@uk.ibm.com

©  2016  International  Business  Machines  Corporation 15

A  Simple  Problem

• In  this  activity,  you  will  analyze the  relationship  between  educational  attainment  and  median  income  using  data  from  the  ACS  by  examining  a  scatter  plot  and  linear  model  that  best  fits  that  scatter  plot  and  solving  problems  using  the  linear  equation.

Median  Income Educational  Attainment  in  USD

Less  than  high  school  graduate

19’800

High  school  graduate 28’500

Some  college  or  associate’s  degree

36’000

Bachelor’s  degree 49’500

Graduate  or  professional degree

63’000

Page 16: Machine Learning SystemML v3 - files.meetup.comfiles.meetup.com/7770922/Machine_Learning_SystemML.pdf · Machine(Learningand SystemML Nikolay(Manchev Data$Scientist$Europe E0mail:$nmanchev@uk.ibm.com

©  2016  International  Business  Machines  Corporation 16

Find  the  best  fitting   line

Page 17: Machine Learning SystemML v3 - files.meetup.comfiles.meetup.com/7770922/Machine_Learning_SystemML.pdf · Machine(Learningand SystemML Nikolay(Manchev Data$Scientist$Europe E0mail:$nmanchev@uk.ibm.com

©  2016  International  Business  Machines  Corporation 17

We  always  look  for  patterns

Page 18: Machine Learning SystemML v3 - files.meetup.comfiles.meetup.com/7770922/Machine_Learning_SystemML.pdf · Machine(Learningand SystemML Nikolay(Manchev Data$Scientist$Europe E0mail:$nmanchev@uk.ibm.com

©  2016  International  Business  Machines  Corporation 18

Use  case  #3

• Predictive  model  for  a  bank  campaign• We  want  to  predict  successful  outcomes

Page 19: Machine Learning SystemML v3 - files.meetup.comfiles.meetup.com/7770922/Machine_Learning_SystemML.pdf · Machine(Learningand SystemML Nikolay(Manchev Data$Scientist$Europe E0mail:$nmanchev@uk.ibm.com

©  2016  International  Business  Machines  Corporation 19

You  need  Data  Scientists

Algorithms  are  NOT  the  problemUnderstanding  what  data  goes  into  those  algorithms  and  how  to  interpret  the  results  is  the  crux  of  the  matter

Be  very,  very  carefulInvolving  a  data  scientist  after  you've  gathered  the  data  is  like  involving  a  doctor  after  the  patient....

Page 20: Machine Learning SystemML v3 - files.meetup.comfiles.meetup.com/7770922/Machine_Learning_SystemML.pdf · Machine(Learningand SystemML Nikolay(Manchev Data$Scientist$Europe E0mail:$nmanchev@uk.ibm.com

©  2016  International  Business  Machines  Corporation 20

IBM’s  Commitment   to  Spark

Official  announcement  (15th  June  2015)• IBM  will  build  Spark  into  the  core  of  its  analytics  and  commerce  platforms• IBM  will  commit  over  3,500  researchers  &  developers  to  work  on  Spark-­related  projects• IBM  will  educate  more  than  1’000’000  data  scientists  on  Spark

Page 21: Machine Learning SystemML v3 - files.meetup.comfiles.meetup.com/7770922/Machine_Learning_SystemML.pdf · Machine(Learningand SystemML Nikolay(Manchev Data$Scientist$Europe E0mail:$nmanchev@uk.ibm.com

©  2016  International  Business  Machines  Corporation 21

Big  Data  University   -­ free  online   training http://bigdatauniversity.com/

Page 22: Machine Learning SystemML v3 - files.meetup.comfiles.meetup.com/7770922/Machine_Learning_SystemML.pdf · Machine(Learningand SystemML Nikolay(Manchev Data$Scientist$Europe E0mail:$nmanchev@uk.ibm.com

©  2016  International  Business  Machines  Corporation 22

Data  Science  before  “Big  Data”

Page 23: Machine Learning SystemML v3 - files.meetup.comfiles.meetup.com/7770922/Machine_Learning_SystemML.pdf · Machine(Learningand SystemML Nikolay(Manchev Data$Scientist$Europe E0mail:$nmanchev@uk.ibm.com

©  2016  International  Business  Machines  Corporation 23

Enter  “Big  Data”

Page 24: Machine Learning SystemML v3 - files.meetup.comfiles.meetup.com/7770922/Machine_Learning_SystemML.pdf · Machine(Learningand SystemML Nikolay(Manchev Data$Scientist$Europe E0mail:$nmanchev@uk.ibm.com

©  2016  International  Business  Machines  Corporation 24

Obvious  solution  “Big  Data”

Page 25: Machine Learning SystemML v3 - files.meetup.comfiles.meetup.com/7770922/Machine_Learning_SystemML.pdf · Machine(Learningand SystemML Nikolay(Manchev Data$Scientist$Europe E0mail:$nmanchev@uk.ibm.com

©  2016  International  Business  Machines  Corporation 25

IBM’s  Commitment   to  Spark

Official  announcement  (15th  June  2015)• IBM  will  build  Spark  into  the  core  of  its  analytics  and  commerce  platforms• IBM  will  commit  over  3,500  researchers  &  developers  to  work  on  Spark-­related  projects• IBM  will  educate  more  than  1’000’000  data  scientists  on  Spark• IBM  will  IBM  will  open  source  SystemML and  collaborate  with  Databricks to  advance  Spark’s  machine  learning  capabilities

Page 26: Machine Learning SystemML v3 - files.meetup.comfiles.meetup.com/7770922/Machine_Learning_SystemML.pdf · Machine(Learningand SystemML Nikolay(Manchev Data$Scientist$Europe E0mail:$nmanchev@uk.ibm.com

©  2016  International  Business  Machines  Corporation 26

Linear  Regression  Refresher

• Simple  Linear  Regression– Dependent  variable   (y)– Independent   variables  (X)

• In  order  to  estimate  the  parameters  we  have  to  minimize

• There  is  an  elegant  solution  that  minimizes                :

We  can  solve  using  Ra = t(X) %*% X + diag(lambda);b = t(X) %*% y;theta = solve(a,b);

Page 27: Machine Learning SystemML v3 - files.meetup.comfiles.meetup.com/7770922/Machine_Learning_SystemML.pdf · Machine(Learningand SystemML Nikolay(Manchev Data$Scientist$Europe E0mail:$nmanchev@uk.ibm.com

©  2016  International  Business  Machines  Corporation 27

Linear  Regression  -­ Executiona = t(X) %*% X + diag(lambda);b = t(X) %*% y;theta = solve(a,b);

X y

500  features300M  observations

4TB  text  file

300M  observations9GB  text  file

Cluster  Configuration3.5  GB  Map  Task  JVM

7  GB  In-­memory  Master  JVM128  MB  HDFS  block  size

X

.

.

1k

1k

yT

MAP MAP MAP…

REDUCE

a bT

XTX  for  each  yTX for  each   1k

1kIn-­memory  computation

(a,b)  <  2  MB1.  get  b2.  call  solve(a,b)

Page 28: Machine Learning SystemML v3 - files.meetup.comfiles.meetup.com/7770922/Machine_Learning_SystemML.pdf · Machine(Learningand SystemML Nikolay(Manchev Data$Scientist$Europe E0mail:$nmanchev@uk.ibm.com

©  2016  International  Business  Machines  Corporation 28

Changes  that  impact  our  implementation

• 3  times  more  attributes

• 2  times  more  observations

• The  dataset  fits  in  memory

• Cluster  configuration  change

………

… … … …

1’500

300M

500

600M

100

1M

Cluster  Configuration3.5  GB  Map  Task  JVM7  GB  In-­memory  Master  JVM128  MB  HDFS  block  size

Cluster  Configuration3.5  GB  Map  Task  JVM7  GB  In-­memory  Master  JVM128  MB  HDFS  block  size

Cluster  Configuration3.5  GB  Map  Task  JVM7  GB  In-­memory  Master  JVM128  MB  HDFS  block  size

500

300M

Cluster  Configuration1.5  GB  Map  Task  JVM7  GB  In-­memory  Master  JVM128  MB  HDFS  block  size

XTX

XTX

XTy

solve(a,b)

XTX

XTy

XTy

solve(a,b)

solve(XTX, XTy)

XTX

XTy

XTy

solve(a,b)

Page 29: Machine Learning SystemML v3 - files.meetup.comfiles.meetup.com/7770922/Machine_Learning_SystemML.pdf · Machine(Learningand SystemML Nikolay(Manchev Data$Scientist$Europe E0mail:$nmanchev@uk.ibm.com

©  2016  International  Business  Machines  Corporation 29

To  Summarize

• 3  lines  of  code• Minor  changes  in  the  data  set  /  cluster  configuration  result  in– 4  dramatically  different  execution  plans– major  change   in  performance– best  solution  becomes  a  non-­working  solution

• How  can  we  manage  this?

Page 30: Machine Learning SystemML v3 - files.meetup.comfiles.meetup.com/7770922/Machine_Learning_SystemML.pdf · Machine(Learningand SystemML Nikolay(Manchev Data$Scientist$Europe E0mail:$nmanchev@uk.ibm.com

©  2016  International  Business  Machines  Corporation 30

What’s  in  the  SystemML box

High-­Level  Operations   (HOPs)

General representation of statements in the data analysis language

Low-­Level  Operations   (LOPs)General representation of operations in the runtime framework

High-level language front-ends

Multiple executionenvironments

Page 31: Machine Learning SystemML v3 - files.meetup.comfiles.meetup.com/7770922/Machine_Learning_SystemML.pdf · Machine(Learningand SystemML Nikolay(Manchev Data$Scientist$Europe E0mail:$nmanchev@uk.ibm.com

©  2016  International  Business  Machines  Corporation 31

Backend  performance

Page 32: Machine Learning SystemML v3 - files.meetup.comfiles.meetup.com/7770922/Machine_Learning_SystemML.pdf · Machine(Learningand SystemML Nikolay(Manchev Data$Scientist$Europe E0mail:$nmanchev@uk.ibm.com

©  2016  International  Business  Machines  Corporation 32

Out-­of-­the-­box  algorithmsCategory Description

Descriptive  Statistics Univariate,   Bivariate,   Stratified Bivariate

Classification Logistic  Regression,   Multi-­class  SVM,  Naïve  Bayes,  Decision Trees,  Random   Forest

Clustering k-­Means

Regression Linear  Regression   (System  of  equations,   SGD)

Generalised   Linear  Models Distributions:   Gaussian, Poisson,  Gamma,  Inverse  Gaussian,  Binomial,  Bernoulli

Links  for  all  distributions:   identity, log,  sq.  root,   inverse,  1/μ^2

Links  for  Binomial/ Bernoulli:   logit,   probit,  cloglog,   cauchit

Stepwise Linear,   GLM

Dimensionality   Reduction PCA

Matrix  Factorization ALS

Survival Models Kaplan  Meier,   Cox

Predict Scoring

Transformation Recoding,   dummy  coding,  binning,   scaling,  missing   value  imputation

Page 33: Machine Learning SystemML v3 - files.meetup.comfiles.meetup.com/7770922/Machine_Learning_SystemML.pdf · Machine(Learningand SystemML Nikolay(Manchev Data$Scientist$Europe E0mail:$nmanchev@uk.ibm.com

©  2016  International  Business  Machines  Corporation 33

Summary

• Key  features– Cost  based  compilation– Out-­of-­the-­box  scalable  machine  learning  algorithms– Support   for  custom  algorithms• Write  your  own  code  and  don’t  worry  about  scalability,  numeric  stability,  and  optimization

• Use  it  standalone,  with  MR  backend,  or  with  Spark  backend– Fit  into  Spark  APIs,  consume  and  produce  DataFrames– ML  Pipeline  integration– Use  System  ML  from  Scala,  Java,  Python,  R/SparkR– BigR integration  (package)

Page 34: Machine Learning SystemML v3 - files.meetup.comfiles.meetup.com/7770922/Machine_Learning_SystemML.pdf · Machine(Learningand SystemML Nikolay(Manchev Data$Scientist$Europe E0mail:$nmanchev@uk.ibm.com

©  2016  International  Business  Machines  Corporation 34

Additional  Resources

SystemML is  available  on  GitHubhttps://github.com/SparkTC/systemml

An  in-­depth  scientific  perspective  • Ghoting,  Amol,  et  al.  "SystemML:  Declarative  machine  learning  on  MapReduce.“,  ICDE    2011• Boehm,  Matthias,  et  al.  “SystemML’s Optimizer:  Plan  Generation  for  Large-­Scale  Machine  Learning  Programs.”.  IEEE  Data  Eng.  Bull  37.3  (2014).• Huang,  Botong,  et  al.  "Resource  Elasticity  for  Large-­Scale  Machine  Learning.“,  SIGMOD  2015.

Page 35: Machine Learning SystemML v3 - files.meetup.comfiles.meetup.com/7770922/Machine_Learning_SystemML.pdf · Machine(Learningand SystemML Nikolay(Manchev Data$Scientist$Europe E0mail:$nmanchev@uk.ibm.com

IBM  big  data        •      IBM  big  data      •      IBM  big  data  

IBM  big  data        •      IBM  big  data      •      IBM  big  data  

IBM  big  data        •      IBM  big  data

IBM  big  data        •      IBM

 big  data

THINK