lsml19: introduction to large scale machine learning€¦ · 71 kernel trick k may be quite...
TRANSCRIPT
LSML19: Introduction to
Large Scale Machine Learning
MINES ParisTech — March 2019
Chloé-Agathe AzencotCentre for Computational Biology, Mines ParisTech
Practical information● Course website
http://cazencott.info > Teaching > LSML 19● Schedule:
– Morning sessions: 09:30 – 12:30– Afternoon sessions: 14:00 – 17:00
● Lectures are here in L.118● Practicals are in L.117 L.119 L.120● Grade:
– 60% Exam (Friday, 14:00)– 40% RAMP
Today's goals
● Review what machine learning is● Understand scalabilities issues● Discover how to accelerate gradient descents with
stochastic gradient descent
Acknowledgements
Slides inspired by Ala Al-Fuqaha, Ethem Alpaydi, Matthew Blaschko, Léon Bottou, Sanjiv Kumar, Trevor Hastie, Rob Tibshirani and Jean-Philippe Vert
5
Why machine learning?
6
Business Insider:2017 is the year of machine learning
● Improving doctors● Assisting lawyers● Make cars drive themselves
7
Perception
8
Communication
Chatbot example from www.supportbots.aiText courtesy an Eddie Izzard show from 1999 called Dressed to Kill.
9
Reasoning
10
Diagnosis
11
Scientific discovery
LHC image: Anna Pantelia/CERN
12
A common thread: ML
● Using algorithms to build models from example (training) data.
● Statistics + optimization + computer science
13
Empirical risk minimization● Ingredients:
– Data
– Hypothesis class:Shape of the decision function f
– Loss function:Cost/error of f on one data point
● Recipe:Find, among all functions of the hypothesis class, one that minimizes the loss on the training data (empirical risk).
14
features variables descriptors attributes
(Un)supervised learning setting
X y
pn
data matrixdesign matrix
outcometargetlabel
observationssamplesdata points
Binary classification:
Multi-class classification:
Regression:
supervision
● Iris dataset: n=150, p=4, t=1.● Cancer drug sensitivity: n=103,
p=106, t=100.● ImageNet: n=14.106, p=6.103,
t=22.103.
t
● Shopping, e-marketing...: n=O(106), p=O(109), t=O(108).
● Astronomy, GAFAs, web...: n=O(109), n=O(109), n=O(109).
15
● Iris dataset: n=150, p=4, t=1.● Cancer drug sensitivity: n=103,
p=106, t=100.● ImageNet: n=14.106, p=6.103,
t=22.103.
● Shopping, e-marketing...: n=O(106), p=O(109), t=O(108).
● Astronomy, GAFAs, web...: n=O(109), n=O(109), n=O(109).
Scaling ML algorithms● What is large scale?
– Data does not fit in RAM;– Data streams;– Algorithms do not run in a reasonable time on a single
machine.● Important considerations:
– Performance increases with the number of samples.– Likelihood to overfit increases with the number of
features.
16
A brief zoo of ML problems
17
Unsupervised learning
Data
Images, text, measurements, omics data...
ML algo Data!
Learn a new representation of the data
X
p
n
18
Dimensionality reduction
Data
Find a lower-dimensional representation
Data
X
p
n
Images, text, measurements, omics data... X
m
n
ML algo
19
Clustering
Data
Group similar data points together
ML algo
20
Unsupervised learning● Dimensionality reduction
PCA● Clustering
k-means● Density estimation● Feature learning
21
Supervised learning
Data
ML algo
Make predictions
Labels
Predictor
decision function
X
p
n yn
22
ClassificationMake discrete predictions
Binary classification
Multi-class classification
Data
ML algo
Labels
Predictor
23
RegressionMake continuous predictions
Data
ML algo
Labels
Predictor
24
Supervised learning● Regression
Ordinary least squares, ridge regression● Classification
Logistic regression, SVM● Structured output prediction
25
Main ML paradigms● Unsupervised learning:
– Dimensionality reduction;– Clustering;– Density estimation;– Feature learning.
● Supervised learning:– Regression;– Classification;– Structured output prediction.
● Semi-supervised learning.● Reinforcement learning.
26
Dimensionality reduction:Principal Components Analysis
27
Principal Components Analysis● Objective:
– Reduce the dimension without losing the variability in the data;
– Find a low-dimensional space such as to maximize the variance of the data projected onto that space.
● The k-th principal component:– Is orthogonal to all previous components:
– Captures the largest amount of variance.
– Solution: w is the k-th eigenvector of
28
PCA example: Population genetics
Novembre et al, 2008
Genetic data of 1387 Europeans
29
Algorithmic complexity of PCA● Memory:
store the data (X) and the covariance matrix (XTX):
● Runtime:– Computing XTX: – Computing K eigenvectors by Lanczos iterations:– Computing the covariance matrix is more expensive than
computing the K first principal components!● Example n=109, p=108:
– Computing the covariance matrix: 1025 FLOPS. Fastest world computer (Nov 2018): 75 peta flops → 2+ years.
– Storing the covariance matrix: 1016B = (1016/250)PB = 8PB.
n x p p x p
30
Clustering:k-means
31
K-means clustering● Goal:
– Find a cluster assignement– that minimizes the intra-cluster variance:
– centroids:
32
K-means clustering● Goal:
– Find a cluster assignement– that minimizes the intra-cluster variance:
– centroids:● Voronoi tesselation:
33
K-means clustering● Goal:
– Find a cluster assignement– that minimizes the intra-cluster variance:
– centroids:● NP-hard → Iterative algorithm
– Assignment step: fix the centroids, update assignments
– Update step: update the centroids
34
Kmeans clustering
35
Pick 3 centroids at random.
Kmeans clustering
36
Assign each observation to the nearest centroid.
Kmeans clustering
37
Recompute centroids.
Kmeans clustering
38
Re-assign each observation to the nearest centroid.
Kmeans clustering
39
Recompute centroids, and iterate process until convergence.
Kmeans clustering
40
k-means complexity– Assignment step: fix the centroids, update assignments:
Compute n x K distances in p dimensions → – Update step: update the centroids:
Sum n values in p dimensions →T iterations →
– Storage:Store n cluster assignements + K centroids → Store X →
41
Ridge regression
42
Linear regression
Least-squares fit (equivalent to MLE under the assumption of Gaussian noise):
Solution uniquely defined when invertible.
43
Ridge regression● Hoerl & Kennard 1970● Goodness-of-fit + ridge regularization
● Solution unique and always exists
● Limit cases:– λ → 0 : OLS (non-regularized) solution (low bias, high variance);– λ → ∞ : β=0 (high bias, low variance).
● Correlated features get similar weights.
Empirical risk Ridge regularization
44
Complexity of ridge regression
● Computing
● Inverting
● When n >> p, computing is more expensive than inverting it!
45
Regularization path
regularization
regularization
regr
essi
on w
eigh
ts
erro
r (M
SE)
46
Ridge regressionHyperparameter setting
47
Setting λ
Overfitting
Pred
ictio
n er
ror
Model complexity
On training data
On new data
Underfitting
48
Setting λ● Data splitting strategy: cross-validation:
– Cut the training set in k equally-sized chunks.– K folds: one chunk to test, the (K-1) others for training.
Valid
Valid
Valid
Valid
Training
Training
Training
Training
– Cross-validation score: perf averaged over the K folds.
For a grid of values for λ.λ1 λ2 λM...
49
Setting λ● Data splitting strategy: cross-validation:
– Cut the training set in k equally-sized chunks.– K folds: one chunk to test, the (K-1) others for training.
Valid
Valid
Valid
Valid
Training
Training
Training
Training
– Cross-validation score: perf averaged over the K folds.– Choose the λ with the best cross-validation score.
● Multiplies time complexity by KM.
50
Ridge regressionℓ2-regularized learning
51
ℓ2-regularized learning
● Generalization of the ridge regression to any loss.● If the loss is convex, then the problem is strictly convex and
has a unique global solution, which can be found numerically.
Empirical riskℓ2 regularization
loss
52
ℓ2-regularized learning
● Generalization of the ridge regression to any loss.● If the loss is convex, then the problem is strictly convex and
has a unique global solution, which can be found numerically.
Empirical riskℓ2 regularization
loss
● Absolute loss:● Quadratic loss:● ε-insensitive loss:● Huber loss: mix quadratic & linear
53
Gradient descent
54
Gradient descent
● Suppose the function to minimize is derivable:
If the loss is convex, then the problem is strictly convex and has a unique global solution, which can be found numerically.
(u, J(u))
(v, J(v))
(v, J(u)+ J’(u).(vu))
First-order Taylor expansion of f in u
55
Gradient descent● Minimize a derivable, strictly convex function J by
finding where its gradient is 0– Set a0 randomly
– Update: a1 = a0 – α J(∇ a0)– Repeat– Stop when | J(∇ ak)|< ε.
J(a)
J’(a0) < 0
a0 a1
56
Classification
57
Logistic regression
● Model y as a linear function of x?
CAT
NON-CAT
58
Logistic regression
● Model y as a linear function of x?Model the log-odds ratio as a linear function of x.
59
Logistic regression
● Model y as a linear function of x?Model the log-odds ratio as a linear function of x.
p
60
Ridge logistic regression● Le Cessie and van Houwelingen, 1992● Goodness-of-fit + ridge regularization
● Logistic loss: negative conditional likelihood
● No explicit solution.● Smooth convex optimization problem that can be
solved numerically.
Empirical riskRidge regularizationLogistic loss
Newton-Raphson iterations● Minimise J convex, differentiable.● Gradient descent:
● Suppose f is twice differentiable.– Second-order Taylor’s expansion:
– Minimize in v instead of in u.
– Hence use
g
62
Solving the logistic regression
● Can be solved with Newton-Raphson iterations.● Each step is equivalent to solving a weighted ridge regression.● → IRLS: Iteratively Reweighted Least Squares.● Complexity:
Number of iterations
63
Large margin classifiers● Margin:
– classify x as positive if f(x) > 0 and negative otherwise.– ideally, yf(x) > 0.
● Large margin classifier: maximize yf(x):
for a convex, non-increasing function ● Logistic regression:
64
Large margin classifiers
● Linear Support Vector Machine (SVM):
Hinge loss
1
1
65
Linear SVM
● Non-smooth, convex optimization problem; Quadratic Program.● Equivalent to the dual problem
● Complexity (training):– Storing – Optimization:
● Complexity (prediction):– Primal:– Dual:
66
Kernel methods
67
Motivation
68
Non-linear mapping to a feature space
R R2
69
SVM in the feature space
● Train:
● Predict with the decision function
70
Kernel SVM
● Train:
● Predict with the decision function
71
Kernel trick● k may be quite efficient to compute, even if H is a very high-
dimensional or even infinite-dimensional space.● For any positive semi-definite function k, there exists a feature
space H and a feature map φ such that
● Hence you can define mappings implicitely.● Kernel trick: algorithms that only involve the samples through
their dot products can be rewritten using kernels in such a way that they can be applied in the initial space without ever computing the mapping.
72
Non-linear mapping to a feature space
R R2
73
Polynomial kernels
More generally, for
is an inner product in a feature space of all monomials of degree up to d.
74
Gaussian kernel
The feature space has infinite dimension.
75
Kernel ridge regression● Ridge regression in input space:
● In a feature space of dimension d:p x p
d x d
n x n● Ridge regression in sample space:
76
Complexity of KRR
● Computing K:● Storing K:● Inverting K + λI:● Computing a prediction for one sample:
– computing κ:– computing the products:
77
Algorithmic complexity recap
78
Summary
Method Memory Train time Predict timePCA O(p2) O(np2) O(p)k-means O(np) O(Knp) O(Kp)Ridge regression O(p2) O(np2) O(p)Logistic regression O(np) O(np2) O(p)SVM, kernel methods O(np) O(n3) O(np)
● Training can take place offline● … unless data is streaming● Prediction should be fast!
79
Techniques for large-scale ML
● Use the deep learning tricks.→ Deep learning (March 26, F. Moutarde).→ Natural Language Processing (March 27, E. Grave).
● Distribute data & computation on modern archs. → Systems for large-scale ML (March 28, C.-A. Azencott).
● Trade optimization accuracy for speed.
80
Gradient descentsconvergence rates
81
Flavours of convexity● J differentiable is convex iff for all
● J is L-smooth, or L-Lipschitz gradient iff– it is twice differentiable and for all
J doesn’t vary “sharply”.● J is m-strongly convex iff for all
J is “not too flat”, lower-bounded by a quadratic.
82
Gradient descent convergence rates● Gradient descent with fixed step size:
– J convex, L-smooth:converges in O(1/ε) iterations → O(np/ε).
– J m-strongly convex, L-smooth:converges in O(κ log 1/ε) iterations → O(np κ log(1/ε)).
● Newton-Raphson iterations
– J convex, L-smooth:converges in O(log log 1/ε) iterations → O((np2+p3) log log(1/ε)).
83
Gradient descent convergence rates● Gradient descent with fixed step size:
– J convex, L-smooth:converges in O(1/ε) iterations → O(np/ε).
– J m-strongly convex, L-smooth:converges in O(κ log 1/ε) iterations → O(np κ log(1/ε)).
● Newton-Raphson iterations
– J convex, L-smooth:converges in O(log log 1/ε) iterations → O((np2+p3) log log(1/ε)).
Usually unknown...
Stochastic gradient descent
● Gradient descent:
● Stochastic gradient descent:
Sampling with replacement: chose l uniformely at random in {1, 2, …, m}.
– J convex, L-smooth:converges in O(κ/ε2) iterations → O(κp/ε2).
– J m-strongly convex, L-smooth:converges in O(κ/ε) iterations → O(κp/ε).
We got rid of n!
Convex optimization
● A number of ML algorithms can be formulated as convex optimization problems.
● They can be solved numerically thanks to variants of the gradient descent.
● Trading optimization accuracy for speed:– Necessary when reaching accuracy is too resource-
consuming.– Does not necessarily have a large impact on performance:
test error is more important than training error.