lsml19: introduction to large scale machine learning€¦ · 71 kernel trick k may be quite...

85
LSML19: Introducon to Large Scale Machine Learning MINES ParisTech — March 2019 Chloé-Agathe Azencot Centre for Computaonal Biology, Mines ParisTech [email protected]

Upload: others

Post on 25-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

LSML19: Introduction to

Large Scale Machine Learning

MINES ParisTech — March 2019

Chloé-Agathe AzencotCentre for Computational Biology, Mines ParisTech

[email protected]

Page 2: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

Practical information● Course website

http://cazencott.info > Teaching > LSML 19● Schedule:

– Morning sessions: 09:30 – 12:30– Afternoon sessions: 14:00 – 17:00

● Lectures are here in L.118● Practicals are in L.117 L.119 L.120● Grade:

– 60% Exam (Friday, 14:00)– 40% RAMP

Page 3: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

Today's goals

● Review what machine learning is● Understand scalabilities issues● Discover how to accelerate gradient descents with

stochastic gradient descent

Page 4: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

Acknowledgements

Slides inspired by Ala Al-Fuqaha, Ethem Alpaydi, Matthew Blaschko, Léon Bottou, Sanjiv Kumar, Trevor Hastie, Rob Tibshirani and Jean-Philippe Vert

Page 5: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

5

Why machine learning?

Page 6: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

6

Business Insider:2017 is the year of machine learning

● Improving doctors● Assisting lawyers● Make cars drive themselves

Page 7: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

7

Perception

Page 8: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

8

Communication

Chatbot example from www.supportbots.aiText courtesy an Eddie Izzard show from 1999 called Dressed to Kill.

Page 9: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

9

Reasoning

Page 10: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

10

Diagnosis

Page 11: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

11

Scientific discovery

LHC image: Anna Pantelia/CERN

Page 12: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

12

A common thread: ML

● Using algorithms to build models from example (training) data.

● Statistics + optimization + computer science

Page 13: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

13

Empirical risk minimization● Ingredients:

– Data

– Hypothesis class:Shape of the decision function f

– Loss function:Cost/error of f on one data point

● Recipe:Find, among all functions of the hypothesis class, one that minimizes the loss on the training data (empirical risk).

Page 14: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

14

features variables descriptors attributes

(Un)supervised learning setting

X y

pn

data matrixdesign matrix

outcometargetlabel

observationssamplesdata points

Binary classification:

Multi-class classification:

Regression:

supervision

● Iris dataset: n=150, p=4, t=1.● Cancer drug sensitivity: n=103,

p=106, t=100.● ImageNet: n=14.106, p=6.103,

t=22.103.

t

● Shopping, e-marketing...: n=O(106), p=O(109), t=O(108).

● Astronomy, GAFAs, web...: n=O(109), n=O(109), n=O(109).

Page 15: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

15

● Iris dataset: n=150, p=4, t=1.● Cancer drug sensitivity: n=103,

p=106, t=100.● ImageNet: n=14.106, p=6.103,

t=22.103.

● Shopping, e-marketing...: n=O(106), p=O(109), t=O(108).

● Astronomy, GAFAs, web...: n=O(109), n=O(109), n=O(109).

Scaling ML algorithms● What is large scale?

– Data does not fit in RAM;– Data streams;– Algorithms do not run in a reasonable time on a single

machine.● Important considerations:

– Performance increases with the number of samples.– Likelihood to overfit increases with the number of

features.

Page 16: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

16

A brief zoo of ML problems

Page 17: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

17

Unsupervised learning

Data

Images, text, measurements, omics data...

ML algo Data!

Learn a new representation of the data

X

p

n

Page 18: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

18

Dimensionality reduction

Data

Find a lower-dimensional representation

Data

X

p

n

Images, text, measurements, omics data... X

m

n

ML algo

Page 19: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

19

Clustering

Data

Group similar data points together

ML algo

Page 20: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

20

Unsupervised learning● Dimensionality reduction

PCA● Clustering

k-means● Density estimation● Feature learning

Page 21: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

21

Supervised learning

Data

ML algo

Make predictions

Labels

Predictor

decision function

X

p

n yn

Page 22: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

22

ClassificationMake discrete predictions

Binary classification

Multi-class classification

Data

ML algo

Labels

Predictor

Page 23: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

23

RegressionMake continuous predictions

Data

ML algo

Labels

Predictor

Page 24: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

24

Supervised learning● Regression

Ordinary least squares, ridge regression● Classification

Logistic regression, SVM● Structured output prediction

Page 25: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

25

Main ML paradigms● Unsupervised learning:

– Dimensionality reduction;– Clustering;– Density estimation;– Feature learning.

● Supervised learning:– Regression;– Classification;– Structured output prediction.

● Semi-supervised learning.● Reinforcement learning.

Page 26: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

26

Dimensionality reduction:Principal Components Analysis

Page 27: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

27

Principal Components Analysis● Objective:

– Reduce the dimension without losing the variability in the data;

– Find a low-dimensional space such as to maximize the variance of the data projected onto that space.

● The k-th principal component:– Is orthogonal to all previous components:

– Captures the largest amount of variance.

– Solution: w is the k-th eigenvector of

Page 28: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

28

PCA example: Population genetics

Novembre et al, 2008

Genetic data of 1387 Europeans

Page 29: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

29

Algorithmic complexity of PCA● Memory:

store the data (X) and the covariance matrix (XTX):

● Runtime:– Computing XTX: – Computing K eigenvectors by Lanczos iterations:– Computing the covariance matrix is more expensive than

computing the K first principal components!● Example n=109, p=108:

– Computing the covariance matrix: 1025 FLOPS. Fastest world computer (Nov 2018): 75 peta flops → 2+ years.

– Storing the covariance matrix: 1016B = (1016/250)PB = 8PB.

n x p p x p

Page 30: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

30

Clustering:k-means

Page 31: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

31

K-means clustering● Goal:

– Find a cluster assignement– that minimizes the intra-cluster variance:

– centroids:

Page 32: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

32

K-means clustering● Goal:

– Find a cluster assignement– that minimizes the intra-cluster variance:

– centroids:● Voronoi tesselation:

Page 33: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

33

K-means clustering● Goal:

– Find a cluster assignement– that minimizes the intra-cluster variance:

– centroids:● NP-hard → Iterative algorithm

– Assignment step: fix the centroids, update assignments

– Update step: update the centroids

Page 34: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

34

K­means clustering

Page 35: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

35

Pick 3 centroids at random.

K­means clustering

Page 36: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

36

Assign each observation to the nearest centroid.

K­means clustering

Page 37: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

37

Recompute centroids.

K­means clustering

Page 38: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

38

Re-assign each observation to the nearest centroid.

K­means clustering

Page 39: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

39

Recompute centroids, and iterate process until convergence.

K­means clustering

Page 40: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

40

k-means complexity– Assignment step: fix the centroids, update assignments:

Compute n x K distances in p dimensions → – Update step: update the centroids:

Sum n values in p dimensions →T iterations →

– Storage:Store n cluster assignements + K centroids → Store X →

Page 41: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

41

Ridge regression

Page 42: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

42

Linear regression

Least-squares fit (equivalent to MLE under the assumption of Gaussian noise):

Solution uniquely defined when invertible.

Page 43: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

43

Ridge regression● Hoerl & Kennard 1970● Goodness-of-fit + ridge regularization

● Solution unique and always exists

● Limit cases:– λ → 0 : OLS (non-regularized) solution (low bias, high variance);– λ → ∞ : β=0 (high bias, low variance).

● Correlated features get similar weights.

Empirical risk Ridge regularization

Page 44: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

44

Complexity of ridge regression

● Computing

● Inverting

● When n >> p, computing is more expensive than inverting it!

Page 45: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

45

Regularization path

regularization

regularization

regr

essi

on w

eigh

ts

erro

r (M

SE)

Page 46: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

46

Ridge regressionHyperparameter setting

Page 47: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

47

Setting λ

Overfitting

Pred

ictio

n er

ror

Model complexity

On training data

On new data

Underfitting

Page 48: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

48

Setting λ● Data splitting strategy: cross-validation:

– Cut the training set in k equally-sized chunks.– K folds: one chunk to test, the (K-1) others for training.

Valid

Valid

Valid

Valid

Training

Training

Training

Training

– Cross-validation score: perf averaged over the K folds.

For a grid of values for λ.λ1 λ2 λM...

Page 49: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

49

Setting λ● Data splitting strategy: cross-validation:

– Cut the training set in k equally-sized chunks.– K folds: one chunk to test, the (K-1) others for training.

Valid

Valid

Valid

Valid

Training

Training

Training

Training

– Cross-validation score: perf averaged over the K folds.– Choose the λ with the best cross-validation score.

● Multiplies time complexity by KM.

Page 50: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

50

Ridge regressionℓ2-regularized learning

Page 51: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

51

ℓ2-regularized learning

● Generalization of the ridge regression to any loss.● If the loss is convex, then the problem is strictly convex and

has a unique global solution, which can be found numerically.

Empirical riskℓ2 regularization

loss

Page 52: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

52

ℓ2-regularized learning

● Generalization of the ridge regression to any loss.● If the loss is convex, then the problem is strictly convex and

has a unique global solution, which can be found numerically.

Empirical riskℓ2 regularization

loss

● Absolute loss:● Quadratic loss:● ε-insensitive loss:● Huber loss: mix quadratic & linear

Page 53: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

53

Gradient descent

Page 54: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

54

Gradient descent

● Suppose the function to minimize is derivable:

If the loss is convex, then the problem is strictly convex and has a unique global solution, which can be found numerically.

(u, J(u))

(v, J(v))

(v, J(u)+ J’(u).(v­u))

First-order Taylor expansion of f in u

Page 55: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

55

Gradient descent● Minimize a derivable, strictly convex function J by

finding where its gradient is 0– Set a0 randomly

– Update: a1 = a0 – α J(∇ a0)– Repeat– Stop when | J(∇ ak)|< ε.

J(a)

J’(a0) < 0

a0 a1

Page 56: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

56

Classification

Page 57: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

57

Logistic regression

● Model y as a linear function of x?

CAT

NON-CAT

Page 58: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

58

Logistic regression

● Model y as a linear function of x?Model the log-odds ratio as a linear function of x.

Page 59: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

59

Logistic regression

● Model y as a linear function of x?Model the log-odds ratio as a linear function of x.

p

Page 60: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

60

Ridge logistic regression● Le Cessie and van Houwelingen, 1992● Goodness-of-fit + ridge regularization

● Logistic loss: negative conditional likelihood

● No explicit solution.● Smooth convex optimization problem that can be

solved numerically.

Empirical riskRidge regularizationLogistic loss

Page 61: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

Newton-Raphson iterations● Minimise J convex, differentiable.● Gradient descent:

● Suppose f is twice differentiable.– Second-order Taylor’s expansion:

– Minimize in v instead of in u.

– Hence use

g

Page 62: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

62

Solving the logistic regression

● Can be solved with Newton-Raphson iterations.● Each step is equivalent to solving a weighted ridge regression.● → IRLS: Iteratively Reweighted Least Squares.● Complexity:

Number of iterations

Page 63: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

63

Large margin classifiers● Margin:

– classify x as positive if f(x) > 0 and negative otherwise.– ideally, yf(x) > 0.

● Large margin classifier: maximize yf(x):

for a convex, non-increasing function ● Logistic regression:

Page 64: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

64

Large margin classifiers

● Linear Support Vector Machine (SVM):

Hinge loss

1

1

Page 65: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

65

Linear SVM

● Non-smooth, convex optimization problem; Quadratic Program.● Equivalent to the dual problem

● Complexity (training):– Storing – Optimization:

● Complexity (prediction):– Primal:– Dual:

Page 66: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

66

Kernel methods

Page 67: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

67

Motivation

Page 68: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

68

Non-linear mapping to a feature space

R R2

Page 69: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

69

SVM in the feature space

● Train:

● Predict with the decision function

Page 70: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

70

Kernel SVM

● Train:

● Predict with the decision function

Page 71: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

71

Kernel trick● k may be quite efficient to compute, even if H is a very high-

dimensional or even infinite-dimensional space.● For any positive semi-definite function k, there exists a feature

space H and a feature map φ such that

● Hence you can define mappings implicitely.● Kernel trick: algorithms that only involve the samples through

their dot products can be rewritten using kernels in such a way that they can be applied in the initial space without ever computing the mapping.

Page 72: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

72

Non-linear mapping to a feature space

R R2

Page 73: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

73

Polynomial kernels

More generally, for

is an inner product in a feature space of all monomials of degree up to d.

Page 74: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

74

Gaussian kernel

The feature space has infinite dimension.

Page 75: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

75

Kernel ridge regression● Ridge regression in input space:

● In a feature space of dimension d:p x p

d x d

n x n● Ridge regression in sample space:

Page 76: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

76

Complexity of KRR

● Computing K:● Storing K:● Inverting K + λI:● Computing a prediction for one sample:

– computing κ:– computing the products:

Page 77: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

77

Algorithmic complexity recap

Page 78: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

78

Summary

Method Memory Train time Predict timePCA O(p2) O(np2) O(p)k-means O(np) O(Knp) O(Kp)Ridge regression O(p2) O(np2) O(p)Logistic regression O(np) O(np2) O(p)SVM, kernel methods O(np) O(n3) O(np)

● Training can take place offline● … unless data is streaming● Prediction should be fast!

Page 79: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

79

Techniques for large-scale ML

● Use the deep learning tricks.→ Deep learning (March 26, F. Moutarde).→ Natural Language Processing (March 27, E. Grave).

● Distribute data & computation on modern archs. → Systems for large-scale ML (March 28, C.-A. Azencott).

● Trade optimization accuracy for speed.

Page 80: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

80

Gradient descentsconvergence rates

Page 81: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

81

Flavours of convexity● J differentiable is convex iff for all

● J is L-smooth, or L-Lipschitz gradient iff– it is twice differentiable and for all

J doesn’t vary “sharply”.● J is m-strongly convex iff for all

J is “not too flat”, lower-bounded by a quadratic.

Page 82: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

82

Gradient descent convergence rates● Gradient descent with fixed step size:

– J convex, L-smooth:converges in O(1/ε) iterations → O(np/ε).

– J m-strongly convex, L-smooth:converges in O(κ log 1/ε) iterations → O(np κ log(1/ε)).

● Newton-Raphson iterations

– J convex, L-smooth:converges in O(log log 1/ε) iterations → O((np2+p3) log log(1/ε)).

Page 83: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

83

Gradient descent convergence rates● Gradient descent with fixed step size:

– J convex, L-smooth:converges in O(1/ε) iterations → O(np/ε).

– J m-strongly convex, L-smooth:converges in O(κ log 1/ε) iterations → O(np κ log(1/ε)).

● Newton-Raphson iterations

– J convex, L-smooth:converges in O(log log 1/ε) iterations → O((np2+p3) log log(1/ε)).

Usually unknown...

Page 84: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

Stochastic gradient descent

● Gradient descent:

● Stochastic gradient descent:

Sampling with replacement: chose l uniformely at random in {1, 2, …, m}.

– J convex, L-smooth:converges in O(κ/ε2) iterations → O(κp/ε2).

– J m-strongly convex, L-smooth:converges in O(κ/ε) iterations → O(κp/ε).

We got rid of n!

Page 85: LSML19: Introduction to Large Scale Machine Learning€¦ · 71 Kernel trick k may be quite efficient to compute, even if H is a very high- dimensional or even infinite-dimensional

Convex optimization

● A number of ML algorithms can be formulated as convex optimization problems.

● They can be solved numerically thanks to variants of the gradient descent.

● Trading optimization accuracy for speed:– Necessary when reaching accuracy is too resource-

consuming.– Does not necessarily have a large impact on performance:

test error is more important than training error.