machine learning landscape - github pages · vineet bansal research software engineer, center for...

36
The Machine Learning Landscape Vineet Bansal Research Software Engineer, Center for Statistics & Machine Learning [email protected] Oct 31, 2018

Upload: others

Post on 10-Aug-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machine Learning Landscape - GitHub Pages · Vineet Bansal Research Software Engineer, Center for Statistics & Machine Learning vineetb@princeton.edu Oct 31, 2018 ... • Cluster

The Machine Learning Landscape

Vineet BansalResearch Software Engineer, Center for Statistics & Machine Learning

[email protected] 31, 2018

Page 2: Machine Learning Landscape - GitHub Pages · Vineet Bansal Research Software Engineer, Center for Statistics & Machine Learning vineetb@princeton.edu Oct 31, 2018 ... • Cluster

“A field of study that gives computers the ability to learn without being explicitly programmed.”

A machine‐learning system is trained rather than explicitly programmed.

What is ML?

Page 3: Machine Learning Landscape - GitHub Pages · Vineet Bansal Research Software Engineer, Center for Statistics & Machine Learning vineetb@princeton.edu Oct 31, 2018 ... • Cluster

Supervised Learning ‐ Training Data contains desired solutions, or labels

Types of ML Systems

Features

Page 4: Machine Learning Landscape - GitHub Pages · Vineet Bansal Research Software Engineer, Center for Statistics & Machine Learning vineetb@princeton.edu Oct 31, 2018 ... • Cluster

Unsupervised Learning ‐ Training Data is unlabeled

Types of ML Systems

Page 5: Machine Learning Landscape - GitHub Pages · Vineet Bansal Research Software Engineer, Center for Statistics & Machine Learning vineetb@princeton.edu Oct 31, 2018 ... • Cluster

Reinforcement Learning ‐ Training Data does not contain target output, but instead contains somepossible output together with a measure of how goodthat output is.

<input>, <correct output>

<input>, <some output>, <grade for this output>

Types of ML Systems

Page 6: Machine Learning Landscape - GitHub Pages · Vineet Bansal Research Software Engineer, Center for Statistics & Machine Learning vineetb@princeton.edu Oct 31, 2018 ... • Cluster

Classification vs Regression

Page 7: Machine Learning Landscape - GitHub Pages · Vineet Bansal Research Software Engineer, Center for Statistics & Machine Learning vineetb@princeton.edu Oct 31, 2018 ... • Cluster

ML Landscape

Page 8: Machine Learning Landscape - GitHub Pages · Vineet Bansal Research Software Engineer, Center for Statistics & Machine Learning vineetb@princeton.edu Oct 31, 2018 ... • Cluster

Unsupervised Learning - Clustering

Clustering

• Color “clusters” of points in a homogenous cloud of data.

• Use Cases

• Behavioral Segmentation in Marketing• Useful as a pre‐processing step before applying other classification 

algorithms.• Cluster ID could be added as feature for each data point.

Page 9: Machine Learning Landscape - GitHub Pages · Vineet Bansal Research Software Engineer, Center for Statistics & Machine Learning vineetb@princeton.edu Oct 31, 2018 ... • Cluster

Unsupervised Learning - Clusteringk‐means Algorithm

• Guess some cluster centers• Repeat until converged

• E‐Step: assign points to the nearest cluster center• M‐Step: set the cluster centers to the mean

Page 10: Machine Learning Landscape - GitHub Pages · Vineet Bansal Research Software Engineer, Center for Statistics & Machine Learning vineetb@princeton.edu Oct 31, 2018 ... • Cluster

Unsupervised Learning - ClusteringChoosing k

Page 11: Machine Learning Landscape - GitHub Pages · Vineet Bansal Research Software Engineer, Center for Statistics & Machine Learning vineetb@princeton.edu Oct 31, 2018 ... • Cluster

ML Landscape

Page 12: Machine Learning Landscape - GitHub Pages · Vineet Bansal Research Software Engineer, Center for Statistics & Machine Learning vineetb@princeton.edu Oct 31, 2018 ... • Cluster

Linear Regression

Page 13: Machine Learning Landscape - GitHub Pages · Vineet Bansal Research Software Engineer, Center for Statistics & Machine Learning vineetb@princeton.edu Oct 31, 2018 ... • Cluster

Define a Hypothesishϴ(X) = 𝑦 X ϴ 

Define a Cost Function (a measure of how bad we’re doing)

MSE 𝑋, ℎϴ1𝑚 𝑦𝑖 𝑦𝑖 2

Repeat until convergence:• Calculate Cost Function on chosen ϴ• Calculate slope of Cost Function• Tweak ϴ so as to move downhill (reduce Cost Function value)

ϴ is now optimized for our training data.

X = 

1 𝑥11 𝑥2

1 … 𝑥𝑛1

1 𝑥12 𝑥2

2 … 𝑥𝑛2

⋮ ⋮ … … ⋮1 𝑥1

𝑚 𝑥2𝑚 … 𝑥𝑛

𝑚

𝑦1

𝑦2…

𝑦𝑚

ϴ = 

ϴ0ϴ1ϴ2⋮

ϴ𝑛

𝑦

𝑦1

𝑦2…

𝑦𝑚

Linear Regression

Page 14: Machine Learning Landscape - GitHub Pages · Vineet Bansal Research Software Engineer, Center for Statistics & Machine Learning vineetb@princeton.edu Oct 31, 2018 ... • Cluster

Used to estimate the probability that an instance belongs to a particular class. 

Logistic Regression

Page 15: Machine Learning Landscape - GitHub Pages · Vineet Bansal Research Software Engineer, Center for Statistics & Machine Learning vineetb@princeton.edu Oct 31, 2018 ... • Cluster

Logistic Regression

Page 16: Machine Learning Landscape - GitHub Pages · Vineet Bansal Research Software Engineer, Center for Statistics & Machine Learning vineetb@princeton.edu Oct 31, 2018 ... • Cluster

No closed‐form solution, but we can use Gradient Descent!

Logistic Regression

Page 17: Machine Learning Landscape - GitHub Pages · Vineet Bansal Research Software Engineer, Center for Statistics & Machine Learning vineetb@princeton.edu Oct 31, 2018 ... • Cluster

ML Landscape

Page 18: Machine Learning Landscape - GitHub Pages · Vineet Bansal Research Software Engineer, Center for Statistics & Machine Learning vineetb@princeton.edu Oct 31, 2018 ... • Cluster

Overfitting and Underfitting

Page 19: Machine Learning Landscape - GitHub Pages · Vineet Bansal Research Software Engineer, Center for Statistics & Machine Learning vineetb@princeton.edu Oct 31, 2018 ... • Cluster

Bias-Variance Tradeoff

Page 20: Machine Learning Landscape - GitHub Pages · Vineet Bansal Research Software Engineer, Center for Statistics & Machine Learning vineetb@princeton.edu Oct 31, 2018 ... • Cluster

How to ensure that we’re not overfitting to our training data?

Impose a small penalty on model complexity.

l1 penalty (Lasso Regression) l2 penalty (Ridge Regression)

Regularization

Page 21: Machine Learning Landscape - GitHub Pages · Vineet Bansal Research Software Engineer, Center for Statistics & Machine Learning vineetb@princeton.edu Oct 31, 2018 ... • Cluster

Testing and Validation

Page 22: Machine Learning Landscape - GitHub Pages · Vineet Bansal Research Software Engineer, Center for Statistics & Machine Learning vineetb@princeton.edu Oct 31, 2018 ... • Cluster

K-fold Cross Validation

Page 23: Machine Learning Landscape - GitHub Pages · Vineet Bansal Research Software Engineer, Center for Statistics & Machine Learning vineetb@princeton.edu Oct 31, 2018 ... • Cluster

Decision TreeBasic Idea

• Construct a tree to ask a series of questions from your data.

Page 24: Machine Learning Landscape - GitHub Pages · Vineet Bansal Research Software Engineer, Center for Statistics & Machine Learning vineetb@princeton.edu Oct 31, 2018 ... • Cluster

Decision TreeLet’s see how it works on a real dataset.

Page 25: Machine Learning Landscape - GitHub Pages · Vineet Bansal Research Software Engineer, Center for Statistics & Machine Learning vineetb@princeton.edu Oct 31, 2018 ... • Cluster

Decision TreeHow is the tree built?

• Define a Cost Function that measures the impurity of a node.• A node is pure (impurity = 0) if all training instances it applies to belong to the same class.• One possible impurity measure is Gini:

• Search for a feature and threshold that minimizes our Cost Function.• Gini scores of subsets thus produced are weighted by their size.

• Greedy Algorithm – may not produce the optimum tree.

CART Algorithm. ID3 Algorithm for non‐binary trees.

Page 26: Machine Learning Landscape - GitHub Pages · Vineet Bansal Research Software Engineer, Center for Statistics & Machine Learning vineetb@princeton.edu Oct 31, 2018 ... • Cluster

Decision TreeDecision Trees can be used for regression!

• Minimize MSE instead of impurity.

Page 27: Machine Learning Landscape - GitHub Pages · Vineet Bansal Research Software Engineer, Center for Statistics & Machine Learning vineetb@princeton.edu Oct 31, 2018 ... • Cluster

Decision Tree

Advantages

• White Box – easily interpretable

Disadvantages

• Prone to overfitting• Regularize by setting maximum depth

• Comes up only with orthogonal boundaries• Sensitive to training set rotation – Use PCA!

Page 28: Machine Learning Landscape - GitHub Pages · Vineet Bansal Research Software Engineer, Center for Statistics & Machine Learning vineetb@princeton.edu Oct 31, 2018 ... • Cluster

ML Landscape

Page 29: Machine Learning Landscape - GitHub Pages · Vineet Bansal Research Software Engineer, Center for Statistics & Machine Learning vineetb@princeton.edu Oct 31, 2018 ... • Cluster

Ensemble MethodsBasic Idea

• Two Decision Trees by themselves may overfit. But combining their predictions may be a good idea!

Page 30: Machine Learning Landscape - GitHub Pages · Vineet Bansal Research Software Engineer, Center for Statistics & Machine Learning vineetb@princeton.edu Oct 31, 2018 ... • Cluster

Ensemble MethodsBasic Idea

• Two Decision Trees by themselves may overfit. But combining their predictions may be a good idea!

Page 31: Machine Learning Landscape - GitHub Pages · Vineet Bansal Research Software Engineer, Center for Statistics & Machine Learning vineetb@princeton.edu Oct 31, 2018 ... • Cluster

BaggingBagging = Bootstrap Aggregation

• Use the same training algorithm for every predictor, but train them on different random subsets of the training set.

Random Forest is an Ensemble of Decision Trees, generally trained via the baggingmethod.

Page 32: Machine Learning Landscape - GitHub Pages · Vineet Bansal Research Software Engineer, Center for Statistics & Machine Learning vineetb@princeton.edu Oct 31, 2018 ... • Cluster

BoostingBasic Idea

• Train several weak learners sequentially, each trying to correct the errors made by its predecessor.

Adaptive Boosting (ADABoost)• Give more relative weight to the misclassified instances.

Page 33: Machine Learning Landscape - GitHub Pages · Vineet Bansal Research Software Engineer, Center for Statistics & Machine Learning vineetb@princeton.edu Oct 31, 2018 ... • Cluster

BoostingGradient Boosting

• Try to fit a new predictor to the residual errorsmade by the previous predictor.

Best Performance

• Random Forests and Gradient Boosting Methods (implemented in the xgBoost library) have been winning most competitions on Kaggle recently on structured data.

• Deep Learning (especially Convolutional Networks) is the clear winner for unstructured data problems (perception/speech/vision etc.)

Page 34: Machine Learning Landscape - GitHub Pages · Vineet Bansal Research Software Engineer, Center for Statistics & Machine Learning vineetb@princeton.edu Oct 31, 2018 ... • Cluster

ML Landscape

Page 35: Machine Learning Landscape - GitHub Pages · Vineet Bansal Research Software Engineer, Center for Statistics & Machine Learning vineetb@princeton.edu Oct 31, 2018 ... • Cluster

ML Landscape

Page 36: Machine Learning Landscape - GitHub Pages · Vineet Bansal Research Software Engineer, Center for Statistics & Machine Learning vineetb@princeton.edu Oct 31, 2018 ... • Cluster

Where to go from here?