lsml19: introduction to large scale machine learning€¦ · 71 kernel trick k may be quite...

LSML19: Introduction to

Large Scale Machine Learning

MINES ParisTech — March 2019

Chloé-Agathe AzencotCentre for Computational Biology, Mines ParisTech

[email protected]

Practical information● Course website

http://cazencott.info > Teaching > LSML 19● Schedule:

– Morning sessions: 09:30 – 12:30– Afternoon sessions: 14:00 – 17:00

● Lectures are here in L.118● Practicals are in L.117 L.119 L.120● Grade:

– 60% Exam (Friday, 14:00)– 40% RAMP

Today's goals

● Review what machine learning is● Understand scalabilities issues● Discover how to accelerate gradient descents with

stochastic gradient descent

Acknowledgements

Slides inspired by Ala Al-Fuqaha, Ethem Alpaydi, Matthew Blaschko, Léon Bottou, Sanjiv Kumar, Trevor Hastie, Rob Tibshirani and Jean-Philippe Vert

5

Why machine learning?

6

Business Insider:2017 is the year of machine learning

● Improving doctors● Assisting lawyers● Make cars drive themselves

7

Perception

8

Communication

Chatbot example from www.supportbots.aiText courtesy an Eddie Izzard show from 1999 called Dressed to Kill.

9

Reasoning

10

Diagnosis

11

Scientific discovery

LHC image: Anna Pantelia/CERN

12

A common thread: ML

● Using algorithms to build models from example (training) data.

● Statistics + optimization + computer science

13

Empirical risk minimization● Ingredients:

– Data

– Hypothesis class:Shape of the decision function f

– Loss function:Cost/error of f on one data point

● Recipe:Find, among all functions of the hypothesis class, one that minimizes the loss on the training data (empirical risk).

14

features variables descriptors attributes

(Un)supervised learning setting

X y

pn

data matrixdesign matrix

outcometargetlabel

observationssamplesdata points

Binary classification:

Multi-class classification:

Regression:

supervision

● Iris dataset: n=150, p=4, t=1.● Cancer drug sensitivity: n=103,

p=106, t=100.● ImageNet: n=14.106, p=6.103,

t=22.103.

t

● Shopping, e-marketing...: n=O(106), p=O(109), t=O(108).

● Astronomy, GAFAs, web...: n=O(109), n=O(109), n=O(109).

15

● Iris dataset: n=150, p=4, t=1.● Cancer drug sensitivity: n=103,

p=106, t=100.● ImageNet: n=14.106, p=6.103,

t=22.103.

● Shopping, e-marketing...: n=O(106), p=O(109), t=O(108).

● Astronomy, GAFAs, web...: n=O(109), n=O(109), n=O(109).

Scaling ML algorithms● What is large scale?

– Data does not fit in RAM;– Data streams;– Algorithms do not run in a reasonable time on a single

machine.● Important considerations:

– Performance increases with the number of samples.– Likelihood to overfit increases with the number of

features.

16

A brief zoo of ML problems

17

Unsupervised learning

Data

Images, text, measurements, omics data...

ML algo Data!

Learn a new representation of the data

X

p

n

18

Dimensionality reduction

Data

Find a lower-dimensional representation

Data

X

p

n

Images, text, measurements, omics data... X

m

n

ML algo

19

Clustering

Data

Group similar data points together

ML algo

20

Unsupervised learning● Dimensionality reduction

PCA● Clustering

k-means● Density estimation● Feature learning

21

Supervised learning

Data

ML algo

Make predictions

Labels

Predictor

decision function

X

p

n yn

22

ClassificationMake discrete predictions

Binary classification

Multi-class classification

Data

ML algo

Labels

Predictor

23

RegressionMake continuous predictions

Data

ML algo

Labels

Predictor

24

Supervised learning● Regression

Ordinary least squares, ridge regression● Classification

Logistic regression, SVM● Structured output prediction

25

Main ML paradigms● Unsupervised learning:

– Dimensionality reduction;– Clustering;– Density estimation;– Feature learning.

● Supervised learning:– Regression;– Classification;– Structured output prediction.

● Semi-supervised learning.● Reinforcement learning.

26

Dimensionality reduction:Principal Components Analysis

27

Principal Components Analysis● Objective:

– Reduce the dimension without losing the variability in the data;

– Find a low-dimensional space such as to maximize the variance of the data projected onto that space.

● The k-th principal component:– Is orthogonal to all previous components:

– Captures the largest amount of variance.

– Solution: w is the k-th eigenvector of

28

PCA example: Population genetics

Novembre et al, 2008

Genetic data of 1387 Europeans

29

Algorithmic complexity of PCA● Memory:

store the data (X) and the covariance matrix (XTX):

● Runtime:– Computing XTX: – Computing K eigenvectors by Lanczos iterations:– Computing the covariance matrix is more expensive than

computing the K first principal components!● Example n=109, p=108:

– Computing the covariance matrix: 1025 FLOPS. Fastest world computer (Nov 2018): 75 peta flops → 2+ years.

– Storing the covariance matrix: 1016B = (1016/250)PB = 8PB.

n x p p x p

30

Clustering:k-means

31

K-means clustering● Goal:

– Find a cluster assignement– that minimizes the intra-cluster variance:

– centroids:

32



– centroids:● Voronoi tesselation:

33



– centroids:● NP-hard → Iterative algorithm

– Assignment step: fix the centroids, update assignments

– Update step: update the centroids

34

Kmeans clustering

35

Pick 3 centroids at random.

Kmeans clustering

36

Assign each observation to the nearest centroid.

Kmeans clustering

37

Recompute centroids.

Kmeans clustering

38

Re-assign each observation to the nearest centroid.

Kmeans clustering

39

Recompute centroids, and iterate process until convergence.

Kmeans clustering

40

k-means complexity– Assignment step: fix the centroids, update assignments:

Compute n x K distances in p dimensions → – Update step: update the centroids:

Sum n values in p dimensions →T iterations →

– Storage:Store n cluster assignements + K centroids → Store X →

41

Ridge regression

42

Linear regression

Least-squares fit (equivalent to MLE under the assumption of Gaussian noise):

Solution uniquely defined when invertible.

43

Ridge regression● Hoerl & Kennard 1970● Goodness-of-fit + ridge regularization

● Solution unique and always exists

● Limit cases:– λ → 0 : OLS (non-regularized) solution (low bias, high variance);– λ → ∞ : β=0 (high bias, low variance).

● Correlated features get similar weights.

Empirical risk Ridge regularization

44

Complexity of ridge regression

● Computing

● Inverting

● When n >> p, computing is more expensive than inverting it!

45

Regularization path

regularization

regularization

regr

essi

on w

eigh

ts

erro

r (M

SE)

46

Ridge regressionHyperparameter setting

47

Setting λ

Overfitting

Pred

ictio

n er

ror

Model complexity

On training data

On new data

Underfitting

48

Setting λ● Data splitting strategy: cross-validation:

– Cut the training set in k equally-sized chunks.– K folds: one chunk to test, the (K-1) others for training.

Valid

Valid

Valid

Valid

Training

Training

Training

Training

– Cross-validation score: perf averaged over the K folds.

For a grid of values for λ.λ1 λ2 λM...

49

Setting λ● Data splitting strategy: cross-validation:

– Cut the training set in k equally-sized chunks.– K folds: one chunk to test, the (K-1) others for training.

Valid

Valid

Valid

Valid

Training

Training

Training

Training

– Cross-validation score: perf averaged over the K folds.– Choose the λ with the best cross-validation score.

● Multiplies time complexity by KM.

50

Ridge regressionℓ2-regularized learning

51

ℓ2-regularized learning

● Generalization of the ridge regression to any loss.● If the loss is convex, then the problem is strictly convex and

has a unique global solution, which can be found numerically.

Empirical riskℓ2 regularization

loss

52

ℓ2-regularized learning

● Generalization of the ridge regression to any loss.● If the loss is convex, then the problem is strictly convex and

has a unique global solution, which can be found numerically.

Empirical riskℓ2 regularization

loss

● Absolute loss:● Quadratic loss:● ε-insensitive loss:● Huber loss: mix quadratic & linear

53

Gradient descent

54

Gradient descent

● Suppose the function to minimize is derivable:

If the loss is convex, then the problem is strictly convex and has a unique global solution, which can be found numerically.

(u, J(u))

(v, J(v))

(v, J(u)+ J’(u).(vu))

First-order Taylor expansion of f in u

55

Gradient descent● Minimize a derivable, strictly convex function J by

finding where its gradient is 0– Set a0 randomly

– Update: a1 = a0 – α J(∇ a0)– Repeat– Stop when | J(∇ ak)|< ε.

J(a)

J’(a0) < 0

a0 a1

56

Classification

57

Logistic regression

● Model y as a linear function of x?

CAT

NON-CAT

58

Logistic regression

● Model y as a linear function of x?Model the log-odds ratio as a linear function of x.

59

Logistic regression

● Model y as a linear function of x?Model the log-odds ratio as a linear function of x.

p

60

Ridge logistic regression● Le Cessie and van Houwelingen, 1992● Goodness-of-fit + ridge regularization

● Logistic loss: negative conditional likelihood

● No explicit solution.● Smooth convex optimization problem that can be

solved numerically.

Empirical riskRidge regularizationLogistic loss

Newton-Raphson iterations● Minimise J convex, differentiable.● Gradient descent:

● Suppose f is twice differentiable.– Second-order Taylor’s expansion:

– Minimize in v instead of in u.

– Hence use

g

62

Solving the logistic regression

● Can be solved with Newton-Raphson iterations.● Each step is equivalent to solving a weighted ridge regression.● → IRLS: Iteratively Reweighted Least Squares.● Complexity:

Number of iterations

63

Large margin classifiers● Margin:

– classify x as positive if f(x) > 0 and negative otherwise.– ideally, yf(x) > 0.

● Large margin classifier: maximize yf(x):

for a convex, non-increasing function ● Logistic regression:

64

Large margin classifiers

● Linear Support Vector Machine (SVM):

Hinge loss

1

1

65

Linear SVM

● Non-smooth, convex optimization problem; Quadratic Program.● Equivalent to the dual problem

● Complexity (training):– Storing – Optimization:

● Complexity (prediction):– Primal:– Dual:

66

Kernel methods

67

Motivation

68

Non-linear mapping to a feature space

R R2

69

SVM in the feature space

● Train:

● Predict with the decision function

70

Kernel SVM

● Train:

● Predict with the decision function

71

Kernel trick● k may be quite efficient to compute, even if H is a very high-

dimensional or even infinite-dimensional space.● For any positive semi-definite function k, there exists a feature

space H and a feature map φ such that

● Hence you can define mappings implicitely.● Kernel trick: algorithms that only involve the samples through

their dot products can be rewritten using kernels in such a way that they can be applied in the initial space without ever computing the mapping.

72

Non-linear mapping to a feature space

R R2

73

Polynomial kernels

More generally, for

is an inner product in a feature space of all monomials of degree up to d.

74

Gaussian kernel

The feature space has infinite dimension.

75

Kernel ridge regression● Ridge regression in input space:

● In a feature space of dimension d:p x p

d x d

n x n● Ridge regression in sample space:

76

Complexity of KRR

● Computing K:● Storing K:● Inverting K + λI:● Computing a prediction for one sample:

– computing κ:– computing the products:

77

Algorithmic complexity recap

78

Summary

Method Memory Train time Predict timePCA O(p2) O(np2) O(p)k-means O(np) O(Knp) O(Kp)Ridge regression O(p2) O(np2) O(p)Logistic regression O(np) O(np2) O(p)SVM, kernel methods O(np) O(n3) O(np)

● Training can take place offline● … unless data is streaming● Prediction should be fast!

79

Techniques for large-scale ML

● Use the deep learning tricks.→ Deep learning (March 26, F. Moutarde).→ Natural Language Processing (March 27, E. Grave).

● Distribute data & computation on modern archs. → Systems for large-scale ML (March 28, C.-A. Azencott).

● Trade optimization accuracy for speed.

80

Gradient descentsconvergence rates

81

Flavours of convexity● J differentiable is convex iff for all

● J is L-smooth, or L-Lipschitz gradient iff– it is twice differentiable and for all

J doesn’t vary “sharply”.● J is m-strongly convex iff for all

J is “not too flat”, lower-bounded by a quadratic.

82

Gradient descent convergence rates● Gradient descent with fixed step size:

– J convex, L-smooth:converges in O(1/ε) iterations → O(np/ε).

– J m-strongly convex, L-smooth:converges in O(κ log 1/ε) iterations → O(np κ log(1/ε)).

● Newton-Raphson iterations

– J convex, L-smooth:converges in O(log log 1/ε) iterations → O((np2+p3) log log(1/ε)).

83

Gradient descent convergence rates● Gradient descent with fixed step size:

– J convex, L-smooth:converges in O(1/ε) iterations → O(np/ε).

– J m-strongly convex, L-smooth:converges in O(κ log 1/ε) iterations → O(np κ log(1/ε)).

● Newton-Raphson iterations

– J convex, L-smooth:converges in O(log log 1/ε) iterations → O((np2+p3) log log(1/ε)).

Usually unknown...

Stochastic gradient descent

● Gradient descent:

● Stochastic gradient descent:

Sampling with replacement: chose l uniformely at random in {1, 2, …, m}.

– J convex, L-smooth:converges in O(κ/ε2) iterations → O(κp/ε2).

– J m-strongly convex, L-smooth:converges in O(κ/ε) iterations → O(κp/ε).

We got rid of n!

Convex optimization

● A number of ML algorithms can be formulated as convex optimization problems.

● They can be solved numerically thanks to variants of the gradient descent.

● Trading optimization accuracy for speed:– Necessary when reaching accuracy is too resource-

consuming.– Does not necessarily have a large impact on performance:

test error is more important than training error.

lsml19: introduction to large scale machine learning€¦ · 71 kernel trick k may be quite...

Documents