introduction to machine learning with h2o and python

108
Introduction to Machine Learning with H 2 O and Python Jo-fai (Joe) Chow Data Scientist [email protected] @matlabulous PyData Amsterdam Tutorial at GoDataDriven 7 th April, 2017

Upload: jo-fai-chow

Post on 16-Apr-2017

321 views

Category:

Software


0 download

TRANSCRIPT

Page 1: Introduction to Machine Learning with H2O and Python

Introduction to Machine Learning with H2O and Python

Jo-fai (Joe) Chow

Data Scientist

[email protected]

@matlabulous

PyData Amsterdam Tutorial at GoDataDriven7th April, 2017

Page 2: Introduction to Machine Learning with H2O and Python

Slides and Code Examples:

bit.ly/joe_h2o_tutorials

2

Page 3: Introduction to Machine Learning with H2O and Python

About Me

• Civil (Water) Engineer• 2010 – 2015

• Consultant (UK)• Utilities

• Asset Management

• Constrained Optimization

• Industrial PhD (UK)• Infrastructure Design Optimization

• Machine Learning + Water Engineering

• Discovered H2O in 2014

• Data Scientist• 2015

• Virgin Media (UK)

• Domino Data Lab (Silicon Valley, US)

• 2016 – Present

• H2O.ai (Silicon Valley, US)

3

Page 4: Introduction to Machine Learning with H2O and Python

About Me

4

R + H2O + Domino for KaggleGuest Blog Post for Domino & H2O (2014)

• The Long Story• bit.ly/joe_kaggle_story

Page 5: Introduction to Machine Learning with H2O and Python

Agenda

• About H2O.ai• Company

• Machine Learning Platform

• Tutorial• H2O Python Module

• Download & Install

• Step-by-Step Examples:• Basic Data Import / Manipulation

• Regression & Classification (Basics)

• Regression & Classification (Advanced)

• Using H2O in the Cloud

5

Page 6: Introduction to Machine Learning with H2O and Python

Agenda

• About H2O.ai• Company

• Machine Learning Platform

• Tutorial• H2O Python Module

• Download & Install

• Step-by-Step Examples:• Basic Data Import / Manipulation

• Regression & Classification (Basics)

• Regression & Classification (Advanced)

• Using H2O in the Cloud

6

Background Information

For beginners

As if I am working onKaggle competitions

Short Break

Page 7: Introduction to Machine Learning with H2O and Python

About H2O.ai

7

Page 8: Introduction to Machine Learning with H2O and Python

Company OverviewFounded 2011 Venture-backed, debuted in 2012

Products • H2O Open Source In-Memory AI Prediction Engine• Sparkling Water• Steam

Mission Operationalize Data Science, and provide a platform for users to build beautiful data products

Team 70 employees• Distributed Systems Engineers doing Machine Learning• World-class visualization designers

Headquarters Mountain View, CA

8

Page 9: Introduction to Machine Learning with H2O and Python

9

Our Team

Joe

Kuba

Page 10: Introduction to Machine Learning with H2O and Python

Scientific Advisory Council

10

Page 11: Introduction to Machine Learning with H2O and Python

11

Page 12: Introduction to Machine Learning with H2O and Python

12

Joe (2015)

http://www.h2o.ai/gartner-magic-quadrant/

Page 13: Introduction to Machine Learning with H2O and Python

13

Check out our websiteh2o.ai

Page 14: Introduction to Machine Learning with H2O and Python

H2O Machine Learning Platform

14

Page 15: Introduction to Machine Learning with H2O and Python

HDFS

S3

NFS

DistributedIn-Memory

Load Data

Loss-lessCompression

H2O Compute Engine

Production Scoring Environment

Exploratory &Descriptive

Analysis

Feature Engineering &

Selection

Supervised &Unsupervised

Modeling

ModelEvaluation &

Selection

Predict

Data & ModelStorage

Model Export:Plain Old Java Object

YourImagination

Data Prep Export:Plain Old Java Object

Local

SQL

High Level Architecture

15

Page 16: Introduction to Machine Learning with H2O and Python

HDFS

S3

NFS

DistributedIn-Memory

Load Data

Loss-lessCompression

H2O Compute Engine

Production Scoring Environment

Exploratory &Descriptive

Analysis

Feature Engineering &

Selection

Supervised &Unsupervised

Modeling

ModelEvaluation &

Selection

Predict

Data & ModelStorage

Model Export:Plain Old Java Object

YourImagination

Data Prep Export:Plain Old Java Object

Local

SQL

High Level Architecture

16

Import Data from Multiple Sources

Page 17: Introduction to Machine Learning with H2O and Python

HDFS

S3

NFS

DistributedIn-Memory

Load Data

Loss-lessCompression

H2O Compute Engine

Production Scoring Environment

Exploratory &Descriptive

Analysis

Feature Engineering &

Selection

Supervised &Unsupervised

Modeling

ModelEvaluation &

Selection

Predict

Data & ModelStorage

Model Export:Plain Old Java Object

YourImagination

Data Prep Export:Plain Old Java Object

Local

SQL

High Level Architecture

17

Fast, Scalable & Distributed Compute Engine Written in

Java

Page 18: Introduction to Machine Learning with H2O and Python

HDFS

S3

NFS

DistributedIn-Memory

Load Data

Loss-lessCompression

H2O Compute Engine

Production Scoring Environment

Exploratory &Descriptive

Analysis

Feature Engineering &

Selection

Supervised &Unsupervised

Modeling

ModelEvaluation &

Selection

Predict

Data & ModelStorage

Model Export:Plain Old Java Object

YourImagination

Data Prep Export:Plain Old Java Object

Local

SQL

High Level Architecture

18

Fast, Scalable & Distributed Compute Engine Written in

Java

Page 19: Introduction to Machine Learning with H2O and Python

Supervised Learning

• Generalized Linear Models: Binomial, Gaussian, Gamma, Poisson and Tweedie

• Naïve Bayes

Statistical Analysis

Ensembles

• Distributed Random Forest: Classification or regression models

• Gradient Boosting Machine: Produces an ensemble of decision trees with increasing refined approximations

Deep Neural Networks

• Deep learning: Create multi-layer feed forward neural networks starting with an input layer followed by multiple layers of nonlinear transformations

Algorithms OverviewUnsupervised Learning

• K-means: Partitions observations into k clusters/groups of the same spatial size. Automatically detect optimal k

Clustering

Dimensionality Reduction

• Principal Component Analysis: Linearly transforms correlated variables to independent components

• Generalized Low Rank Models: extend the idea of PCA to handle arbitrary data consisting of numerical, Boolean, categorical, and missing data

Anomaly Detection

• Autoencoders: Find outliers using a nonlinear dimensionality reduction using deep learning

19

Page 20: Introduction to Machine Learning with H2O and Python

H2O Deep Learning in Action

20

Page 21: Introduction to Machine Learning with H2O and Python

HDFS

S3

NFS

DistributedIn-Memory

Load Data

Loss-lessCompression

H2O Compute Engine

Production Scoring Environment

Exploratory &Descriptive

Analysis

Feature Engineering &

Selection

Supervised &Unsupervised

Modeling

ModelEvaluation &

Selection

Predict

Data & ModelStorage

Model Export:Plain Old Java Object

YourImagination

Data Prep Export:Plain Old Java Object

Local

SQL

High Level Architecture

21

Multiple Interfaces

Page 22: Introduction to Machine Learning with H2O and Python

H2O + Python

22

Page 23: Introduction to Machine Learning with H2O and Python

H2O + R

23

Page 24: Introduction to Machine Learning with H2O and Python

24

H2O Flow (Web) Interface

Page 25: Introduction to Machine Learning with H2O and Python

HDFS

S3

NFS

DistributedIn-Memory

Load Data

Loss-lessCompression

H2O Compute Engine

Production Scoring Environment

Exploratory &Descriptive

Analysis

Feature Engineering &

Selection

Supervised &Unsupervised

Modeling

ModelEvaluation &

Selection

Predict

Data & ModelStorage

Model Export:Plain Old Java Object

YourImagination

Data Prep Export:Plain Old Java Object

Local

SQL

High Level Architecture

25

Export Standalone Models for Production

Page 26: Introduction to Machine Learning with H2O and Python

26

docs.h2o.ai

Page 27: Introduction to Machine Learning with H2O and Python

H2O + Python Tutorial

27

Page 28: Introduction to Machine Learning with H2O and Python

Learning Objectives

• Start and connect to a local H2O cluster from Python.

• Import data from Python data frames, local files or web.

• Perform basic data transformation and exploration.

• Train regression and classification models using various H2O machine learning algorithms.

• Evaluate models and make predictions.

• Improve performance by tuning and stacking.

• Connect to H2O cluster in the cloud.

28

Page 29: Introduction to Machine Learning with H2O and Python

29

Page 30: Introduction to Machine Learning with H2O and Python

Install H2Oh2o.ai -> Download -> Install in Python

30

Page 31: Introduction to Machine Learning with H2O and Python

31

Page 32: Introduction to Machine Learning with H2O and Python

Start and Connect to a Local H2O Clusterpy_01_data_in_h2o.ipynb

32

Page 33: Introduction to Machine Learning with H2O and Python

Local H2O Cluster

33

Import H2O module

Start a local H2O clusternthreads = -1 means

using ALL CPU resources

Page 34: Introduction to Machine Learning with H2O and Python

34

Page 35: Introduction to Machine Learning with H2O and Python

Importing Data into H2Opy_01_data_in_h2o.ipynb

35

Page 36: Introduction to Machine Learning with H2O and Python

36

Page 37: Introduction to Machine Learning with H2O and Python

37

Page 38: Introduction to Machine Learning with H2O and Python

38

Page 39: Introduction to Machine Learning with H2O and Python

Basic Data Transformation & Explorationpy_02_data_manipulation.ipynb

(see notebooks)

39

Page 40: Introduction to Machine Learning with H2O and Python

40

Page 41: Introduction to Machine Learning with H2O and Python

41

Page 42: Introduction to Machine Learning with H2O and Python

42

Page 43: Introduction to Machine Learning with H2O and Python

43

Page 44: Introduction to Machine Learning with H2O and Python

Regression Models (Basics)py_03a_regression_basics.ipynb

44

Page 45: Introduction to Machine Learning with H2O and Python

Supervised Learning

• Generalized Linear Models: Binomial, Gaussian, Gamma, Poisson and Tweedie

• Naïve Bayes

Statistical Analysis

Ensembles

• Distributed Random Forest: Classification or regression models

• Gradient Boosting Machine: Produces an ensemble of decision trees with increasing refined approximations

Deep Neural Networks

• Deep learning: Create multi-layer feed forward neural networks starting with an input layer followed by multiple layers of nonlinear transformations

Algorithms OverviewUnsupervised Learning

• K-means: Partitions observations into k clusters/groups of the same spatial size. Automatically detect optimal k

Clustering

Dimensionality Reduction

• Principal Component Analysis: Linearly transforms correlated variables to independent components

• Generalized Low Rank Models: extend the idea of PCA to handle arbitrary data consisting of numerical, Boolean, categorical, and missing data

Anomaly Detection

• Autoencoders: Find outliers using a nonlinear dimensionality reduction using deep learning

45

Page 46: Introduction to Machine Learning with H2O and Python

46

docs.h2o.ai

Page 47: Introduction to Machine Learning with H2O and Python

47

Page 48: Introduction to Machine Learning with H2O and Python

48

Page 49: Introduction to Machine Learning with H2O and Python

49

Page 50: Introduction to Machine Learning with H2O and Python

50

Page 51: Introduction to Machine Learning with H2O and Python

51

Regression Performance – MSE

Page 52: Introduction to Machine Learning with H2O and Python

52

Page 53: Introduction to Machine Learning with H2O and Python

53

Page 54: Introduction to Machine Learning with H2O and Python

54

Page 55: Introduction to Machine Learning with H2O and Python

55

Page 56: Introduction to Machine Learning with H2O and Python

56

Page 57: Introduction to Machine Learning with H2O and Python

Classification Models (Basics)py_04_classification_basics.ipynb

57

Page 58: Introduction to Machine Learning with H2O and Python

58

Page 59: Introduction to Machine Learning with H2O and Python

59

Page 60: Introduction to Machine Learning with H2O and Python

60

Page 61: Introduction to Machine Learning with H2O and Python

61

Page 62: Introduction to Machine Learning with H2O and Python

Classification Performance – Confusion Matrix

62

Page 63: Introduction to Machine Learning with H2O and Python

Confusion Matrix

63

Page 64: Introduction to Machine Learning with H2O and Python

64

Page 65: Introduction to Machine Learning with H2O and Python

65

Page 66: Introduction to Machine Learning with H2O and Python

66

Page 67: Introduction to Machine Learning with H2O and Python

67

Page 68: Introduction to Machine Learning with H2O and Python

68

Page 69: Introduction to Machine Learning with H2O and Python

Regression Models (Tuning)py_03b_regression_grid_search.ipynb

69

Page 70: Introduction to Machine Learning with H2O and Python

Improving Model Performance (Step-by-Step)

70

Model Settings MSE (CV) MSE (Test)

GBM with default settings N/A 0.4551

GBM with manual settings N/A 0.4433

Manual settings + cross-validation 0.4502 0.4433

Manual + CV + early stopping 0.4429 0.4287

CV + early stopping + full grid search 0.4378 0.4196

CV + early stopping + random grid search 0.4227 0.4047

Stacking models from random grid search N/A 0.3969

Page 71: Introduction to Machine Learning with H2O and Python

71

Page 72: Introduction to Machine Learning with H2O and Python

72

Page 73: Introduction to Machine Learning with H2O and Python

Improving Model Performance (Step-by-Step)

73

Model Settings MSE (CV) MSE (Test)

GBM with default settings N/A 0.4551

GBM with manual settings N/A 0.4433

Manual settings + cross-validation 0.4502 0.4433

Manual + CV + early stopping 0.4429 0.4287

CV + early stopping + full grid search 0.4378 0.4196

CV + early stopping + random grid search 0.4227 0.4047

Stacking models from random grid search N/A 0.3969

Page 74: Introduction to Machine Learning with H2O and Python

74

Page 75: Introduction to Machine Learning with H2O and Python

Improving Model Performance (Step-by-Step)

75

Model Settings MSE (CV) MSE (Test)

GBM with default settings N/A 0.4551

GBM with manual settings N/A 0.4433

Manual settings + cross-validation 0.4502 0.4433

Manual + CV + early stopping 0.4429 0.4287

CV + early stopping + full grid search 0.4378 0.4196

CV + early stopping + random grid search 0.4227 0.4047

Stacking models from random grid search N/A 0.3969

Page 76: Introduction to Machine Learning with H2O and Python

Cross-Validation

76

Page 77: Introduction to Machine Learning with H2O and Python

77

Page 78: Introduction to Machine Learning with H2O and Python

78

Page 79: Introduction to Machine Learning with H2O and Python

Improving Model Performance (Step-by-Step)

79

Model Settings MSE (CV) MSE (Test)

GBM with default settings N/A 0.4551

GBM with manual settings N/A 0.4433

Manual settings + cross-validation 0.4502 0.4433

Manual + CV + early stopping 0.4429 0.4287

CV + early stopping + full grid search 0.4378 0.4196

CV + early stopping + random grid search 0.4227 0.4047

Stacking models from random grid search N/A 0.3969

Page 80: Introduction to Machine Learning with H2O and Python

Early Stopping

80

Page 81: Introduction to Machine Learning with H2O and Python

81

Page 82: Introduction to Machine Learning with H2O and Python

Improving Model Performance (Step-by-Step)

82

Model Settings MSE (CV) MSE (Test)

GBM with default settings N/A 0.4551

GBM with manual settings N/A 0.4433

Manual settings + cross-validation 0.4502 0.4433

Manual + CV + early stopping 0.4429 0.4287

CV + early stopping + full grid search 0.4378 0.4196

CV + early stopping + random grid search 0.4227 0.4047

Stacking models from random grid search N/A 0.3969

Page 83: Introduction to Machine Learning with H2O and Python

Grid Search

83

Combination Parameter 1 Parameter 2

1 0.7 0.7

2 0.7 0.8

3 0.7 0.9

4 0.8 0.7

5 0.8 0.8

6 0.8 0.9

7 0.9 0.7

8 0.9 0.8

9 0.9 0.9

Page 84: Introduction to Machine Learning with H2O and Python

84

Page 85: Introduction to Machine Learning with H2O and Python

85

Page 86: Introduction to Machine Learning with H2O and Python

86

Page 87: Introduction to Machine Learning with H2O and Python

Improving Model Performance (Step-by-Step)

87

Model Settings MSE (CV) MSE (Test)

GBM with default settings N/A 0.4551

GBM with manual settings N/A 0.4433

Manual settings + cross-validation 0.4502 0.4433

Manual + CV + early stopping 0.4429 0.4287

CV + early stopping + full grid search 0.4378 0.4196

CV + early stopping + random grid search 0.4227 0.4047

Stacking models from random grid search N/A 0.3969

Page 88: Introduction to Machine Learning with H2O and Python

88

Page 89: Introduction to Machine Learning with H2O and Python

89

Page 90: Introduction to Machine Learning with H2O and Python

Improving Model Performance (Step-by-Step)

90

Model Settings MSE (CV) MSE (Test)

GBM with default settings N/A 0.4551

GBM with manual settings N/A 0.4433

Manual settings + cross-validation 0.4502 0.4433

Manual + CV + early stopping 0.4429 0.4287

CV + early stopping + full grid search 0.4378 0.4196

CV + early stopping + random grid search 0.4227 0.4047

Stacking models from random grid search N/A 0.3969

Page 91: Introduction to Machine Learning with H2O and Python

Regression Models (Ensembles)py_03c_regression_ensembles.ipynb

91

Page 92: Introduction to Machine Learning with H2O and Python

92

https://github.com/h2oai/h2o-meetups/blob/master/2017_02_23_Metis_SF_Sacked_Ensembles_Deep_Water/stacked_ensembles_in_h2o_feb2017.pdf

Page 93: Introduction to Machine Learning with H2O and Python

93

Page 94: Introduction to Machine Learning with H2O and Python

94

Page 95: Introduction to Machine Learning with H2O and Python

95

Page 96: Introduction to Machine Learning with H2O and Python

96

Lowest MSE = Best Performance

Page 97: Introduction to Machine Learning with H2O and Python

Improving Model Performance (Step-by-Step)

97

Model Settings MSE (CV) MSE (Test)

GBM with default settings N/A 0.4551

GBM with manual settings N/A 0.4433

Manual settings + cross-validation 0.4502 0.4433

Manual + CV + early stopping 0.4429 0.4287

CV + early stopping + full grid search 0.4378 0.4196

CV + early stopping + random grid search 0.4227 0.4047

Stacking models from random grid search N/A 0.3969

Page 98: Introduction to Machine Learning with H2O and Python

Classification Models (Ensembles)py_04_classification_ensembles.ipynb

98

Page 99: Introduction to Machine Learning with H2O and Python

99

Highest AUC = Best Performance

Page 100: Introduction to Machine Learning with H2O and Python

H2O in the Cloudpy_05_h2o_in_the_cloud.ipynb

100

Page 101: Introduction to Machine Learning with H2O and Python

101

Page 102: Introduction to Machine Learning with H2O and Python

102

Page 103: Introduction to Machine Learning with H2O and Python

Recap

103

Page 104: Introduction to Machine Learning with H2O and Python

Learning Objectives

• Start and connect to a local H2O cluster from Python.

• Import data from Python data frames, local files or web.

• Perform basic data transformation and exploration.

• Train regression and classification models using various H2O machine learning algorithms.

• Evaluate models and make predictions.

• Improve performance by tuning and stacking.

• Connect to H2O cluster in the cloud.

104

Page 105: Introduction to Machine Learning with H2O and Python

Improving Model Performance (Step-by-Step)

105

Model Settings MSE (CV) MSE (Test)

GBM with default settings N/A 0.4551

GBM with manual settings N/A 0.4433

Manual settings + cross-validation 0.4502 0.4433

Manual + CV + early stopping 0.4429 0.4287

CV + early stopping + full grid search 0.4378 0.4196

CV + early stopping + random grid search 0.4227 0.4047

Stacking models from random grid search N/A 0.3969

Page 106: Introduction to Machine Learning with H2O and Python

106

Page 107: Introduction to Machine Learning with H2O and Python

107

H2O TutorialFriday – 4:00 pm

bit.ly/joe_h2o_tutorials

H2O Sparkling Water TalkSaturday – 3:15 pm

Page 108: Introduction to Machine Learning with H2O and Python

• Organizers & Sponsors

• Find us at PyData Conference• Live Demos

108

Thanks!

• Code, Slides & Documents• bit.ly/h2o_meetups• docs.h2o.ai

• Contact• [email protected]• @matlabulous• github.com/woobe

• Please search/ask questions on Stack Overflow• Use the tag `h2o` (not H2 zero)