introduction to machine learning -...
TRANSCRIPT
L. Risser CNRS / Institut de Mathématiques de Toulouse
Introduction to Machine Learning
Laurent Risser Institut de Mathématiques de Toulouse
L. Risser CNRS / Institut de Mathématiques de Toulouse
L. Risser CNRS / Institut de Mathématiques de Toulouse
0) From Statistics to Machine Learning
As a starter: From Statistics to Machine Learning
L. Risser CNRS / Institut de Mathématiques de Toulouse
1940-70 : Classic Statistics dealing with tested hypotheses (e.g. are the students in class A significantly taller than the students in class B). There are n ≈ 30 observations and p < 10 variables.
1970s : The use of computers is increasingly popular. Larger volumes of data are explored in Statistics. Expert Systems also start making automatic decisions based on rules injected by expert humans (e.g. if blood pressure < threshold and spots on the skin then take a specific medication).
1980s : Expert Systems are made obsolete by Machine Learning and Neural Networks. Decision rules are automatically defined based on observed data.
1990s : 1st paradigm shift: Observed data are not planned but stored in databases and then re-used From Data Mining to Knowledge Discovery.
2000s : 2nd paradigm shift: The number of variables p is increasingly large (typically in comic data, where p>>n). Prediction quality is more important than the model interpretability (black-box models). Problem of Curse of Dimensionality making regularization important.
2010s : 3rd paradigm shift: Observations number n is now increasingly large in e-commerce, geo-localisation, … . Databased structured in clouds and computations on clusters (big data). Decisions also have to be almost immediate.
… fast and robust interpretation of videos (autonomous vehicles) … explainability of black box decision rules (social issues and certifiability) … complex data (small data)
Course of P. Besse (INSA Toulouse) in Statistical learning http://www.math.univ-toulouse.fr/~besse/enseignement.html
0) From Statistics to Machine Learning
L. Risser CNRS / Institut de Mathématiques de Toulouse
1940-70 : Classic Statistics dealing with tested hypotheses (e.g. are the students in class A significantly taller than the students in class B). There are n ≈ 30 observations and p < 10 variables.
1970s : The use of computers is increasingly popular. Larger volumes of data are explored in Statistics. Expert Systems also start making automatic decisions based on rules injected by expert humans (e.g. if blood pressure < threshold and spots on the skin then take a specific medication).
1980s : Expert Systems are made obsolete by Machine Learning and Neural Networks. Decision rules are automatically defined based on observed data.
1990s : 1st paradigm shift: Observed data are not planned but stored in databases and then re-used From Data Mining to Knowledge Discovery.
2000s : 2nd paradigm shift: The number of variables p is increasingly large (typically in comic data, where p>>n). Prediction quality is more important than the model interpretability (black-box models). Problem of Curse of Dimensionality making regularization important.
2010s : 3rd paradigm shift: Observations number n is now increasingly large in e-commerce, geo-localisation, … . Databased structured in clouds and computations on clusters (big data). Decisions also have to be almost immediate.
… fast and robust interpretation of videos (autonomous vehicles) … explainability of black box decision rules (social issues and certifiability) … complex data (small data)
Course of P. Besse (INSA Toulouse) in Statistical learning http://www.math.univ-toulouse.fr/~besse/enseignement.html
0) From Statistics to Machine Learning
L. Risser CNRS / Institut de Mathématiques de Toulouse
1940-70 : Classic Statistics dealing with tested hypotheses (e.g. are the students in class A significantly taller than the students in class B). There are n ≈ 30 observations and p < 10 variables.
1970s : The use of computers is increasingly popular. Larger volumes of data are explored in Statistics. Expert Systems also start making automatic decisions based on rules injected by expert humans (e.g. if blood pressure < threshold and spots on the skin then take a specific medication).
1980s : Expert Systems are made obsolete by Machine Learning and Neural Networks. Decision rules are automatically defined based on observed data.
1990s : 1st paradigm shift: Observed data are not planned but stored in databases and then re-used From Data Mining to Knowledge Discovery.
2000s : 2nd paradigm shift: The number of variables p is increasingly large (typically in comic data, where p>>n). Prediction quality is more important than the model interpretability (black-box models). Problem of Curse of Dimensionality making regularization important.
2010s : 3rd paradigm shift: Observations number n is now increasingly large in e-commerce, geo-localisation, … . Databased structured in clouds and computations on clusters (big data). Decisions also have to be almost immediate.
… fast and robust interpretation of videos (autonomous vehicles) … explainability of black box decision rules (social issues and certifiability) … complex data (small data)
Course of P. Besse (INSA Toulouse) in Statistical learning http://www.math.univ-toulouse.fr/~besse/enseignement.html
0) From Statistics to Machine Learning
L. Risser CNRS / Institut de Mathématiques de Toulouse
1940-70 : Classic Statistics dealing with tested hypotheses (e.g. are the students in class A significantly taller than the students in class B). There are n ≈ 30 observations and p < 10 variables.
1970s : The use of computers is increasingly popular. Larger volumes of data are explored in Statistics. Expert Systems also start making automatic decisions based on rules injected by expert humans (e.g. if blood pressure < threshold and spots on the skin then take a specific medication).
1980s : Expert Systems are made obsolete by Machine Learning and Neural Networks. Decision rules are automatically defined based on observed data.
1990s : 1st paradigm shift: Observed data are not planned but stored in databases and then re-used From Data Mining to Knowledge Discovery.
2000s : 2nd paradigm shift: The number of variables p is increasingly large (typically in comic data, where p>>n). Prediction quality is more important than the model interpretability (black-box models). Problem of Curse of Dimensionality making regularization important.
2010s : 3rd paradigm shift: Observations number n is now increasingly large in e-commerce, geo-localisation, … . Databased structured in clouds and computations on clusters (big data). Decisions also have to be almost immediate.
… fast and robust interpretation of videos (autonomous vehicles) … explainability of black box decision rules (social issues and certifiability) … complex data (small data)
Course of P. Besse (INSA Toulouse) in Statistical learning http://www.math.univ-toulouse.fr/~besse/enseignement.html
0) From Statistics to Machine Learning
L. Risser CNRS / Institut de Mathématiques de Toulouse
Talk overview
• From statistics to machine learning • Introductory examples
• Supervised learning • Unsupervised learning
• Classic algorithms in machine learning • K-means • Decision trees and Random forests • SVM • Logistic regression
• Overfitting and cross validation • Overfitting • Cross-validation
• High dimensionality and regularization • Modeling a real-life problem • Effect of regularization • Dimensionality reduction using PCA
• Supervised learning using Neural networks • Conclusion
L. Risser CNRS / Institut de Mathématiques de Toulouse
1) Introductory examples
Two introductory examples
L. Risser CNRS / Institut de Mathématiques de Toulouse
Supervised learning — Training Data
n = 20 observations p = 2 variables (problem dimension) Label with 2 states
Variable 1
Varia
ble
21.a) Introductory examples — Supervised learning
L. Risser CNRS / Institut de Mathématiques de Toulouse
Example : Variable 1 = Age Variable 2 = Monthly incomes State = Buy a product at Christmas
Variable 1
Varia
ble
2
Supervised learning — Training Data
1.a) Introductory examples — Supervised learning
n = 20 observations p = 2 variables (problem dimension) Label with 2 states
L. Risser CNRS / Institut de Mathématiques de Toulouse
?
Most likely state of ?
Variable 1
Varia
ble
2
Supervised learning — Prediction
1.a) Introductory examples — Supervised learning
n = 20 observations p = 2 variables (problem dimension) Label with 2 states
L. Risser CNRS / Institut de Mathématiques de Toulouse
?
Supervised learning (here using a linear model) then prediction on new data
Variable 1
Varia
ble
2
Most likely state of ?
Supervised learning — Prediction
1.a) Introductory examples — Supervised learning
n = 20 observations p = 2 variables (problem dimension) Label with 2 states
L. Risser CNRS / Institut de Mathématiques de Toulouse
?
Supervised learning (here using a linear model) then prediction on new data
Variable 1
Varia
ble
2
Most likely state of ?
Supervised learning — Prediction
1.a) Introductory examples — Supervised learning
n = 20 observations p = 2 variables (problem dimension) Label with 2 states
To sum-up
1. We have labelled training data 2. Choose a model to classify the data 3. Optimisation of the model parameters (learning) as a
function of a loss function (e.g. prediction error) 4. Model validation on a test set 5. Predictions on new observations
L. Risser CNRS / Institut de Mathématiques de Toulouse
Variable 1
Varia
ble
21.b) Introductory examples — Unsupervised learning
Unsupervised learning — Input Data
n = 20 observations p = 2 variables (problem dimension) No label
L. Risser CNRS / Institut de Mathématiques de Toulouse
Variable 1
Varia
ble
21.b) Introductory examples — Unsupervised learning
Unsupervised learning — Input Data
n = 20 observations p = 2 variables (problem dimension) No label
Is it reasonably possible to distinguish several sub-groups
of observations?
L. Risser CNRS / Institut de Mathématiques de Toulouse
Variable 1
Varia
ble
2
dist(xi , xj), j = {1, …, i-1, i+1, …, n}
xi
1.b) Introductory examples — Unsupervised learning
Unsupervised learning — Distance between the observations
n = 20 observations p = 2 variables (problem dimension) No label
Is it reasonably possible to distinguish several sub-groups
of observations?
Distance between the observations {dist(xi, xj)}i, j∈{1,…,n}2
L. Risser CNRS / Institut de Mathématiques de Toulouse
Variable 1
Varia
ble
2
dist(xi+1 , xj), j = {1, …, i, i+2, …, n}
xi+1
1.b) Introductory examples — Unsupervised learning
Unsupervised learning — Distance between the observations
n = 20 observations p = 2 variables (problem dimension) No label
Is it reasonably possible to distinguish several sub-groups
of observations?
Distance between the observations {dist(xi, xj)}i, j∈{1,…,n}2
L. Risser CNRS / Institut de Mathématiques de Toulouse
Variable 1
Varia
ble
2
Group 1Group 2
Group 3
Distance between the observations
Energy to minimize w.r.t the labels yi
e.g.
1.b) Introductory examples — Unsupervised learning
Unsupervised learning — Learning
n = 20 observations p = 2 variables (problem dimension) No label
Is it reasonably possible to distinguish several sub-groups
of observations?
f (y1, …, yn, {dist(xi, xj)}i, j∈{1,…,n}2)
f (…) =n
∑i=1
n
∑j=1
𝕀yi==yj|xi − xj |
22
{dist(xi, xj)}i, j∈{1,…,n}2
L. Risser CNRS / Institut de Mathématiques de Toulouse
Variable 1
Varia
ble
2
Graph clustering
Strong link
Group 1Group 2
Group 3
1.b) Introductory examples — Unsupervised learning
Unsupervised learning — Link with graph clustering
L. Risser CNRS / Institut de Mathématiques de Toulouse
Variable 1
Varia
ble
2
Group 1Group 2
Group 3
Distance between the observations
Energy to minimize w.r.t the labels yi
e.g.
1.b) Introductory examples — Unsupervised learning
Unsupervised learning — Learning
n = 20 observations p = 2 variables (problem dimension) No label
Is it reasonably possible to distinguish several sub-groups
of observations?
f (y1, …, yn, {dist(xi, xj)}i, j∈{1,…,n}2)
f (…) =n
∑i=1
n
∑j=1
𝕀yi==yj|xi − xj |
22
{dist(xi, xj)}i, j∈{1,…,n}2
To sum-up
1. We have training data with no known label 2. Choose a model to classify the data 3. Optimisation of the model parameters (learning) based
on an criterion of inner energy 4. Predictions on new observations
L. Risser CNRS / Institut de Mathématiques de Toulouse
2) Classic algorithms
Classic algorithms in Machine Learning
L. Risser CNRS / Institut de Mathématiques de Toulouse
2.a) Classic algorithms - K-means
K-means algorithm
Variable 1
Varia
ble
2
L. Risser CNRS / Institut de Mathématiques de Toulouse
2.a) Classic algorithms - K-means
Variable 1
Varia
ble
2
K-means algorithm
N seeds are randomly drawn (in this example N=4)
L. Risser CNRS / Institut de Mathématiques de Toulouse
2.a) Classic algorithms - K-means
Variable 1
Varia
ble
2
K-means algorithm
For each observation, we consider the nearest seed.
Remark: Euclidian distances are used in this example
L. Risser CNRS / Institut de Mathématiques de Toulouse
2.a) Classic algorithms - K-means
Variable 1
Varia
ble
2
K-means algorithm
Seeds are then centered according to the their corresponding observations…
L. Risser CNRS / Institut de Mathématiques de Toulouse
2.a) Classic algorithms - K-means
Variable 1
Varia
ble
2
K-means algorithm
… for each observation, we again consider the nearest seed …
L. Risser CNRS / Institut de Mathématiques de Toulouse
2.a) Classic algorithms - K-means
Variable 1
Varia
ble
2
K-means algorithm
… and we re-iterate until convergence.
L. Risser CNRS / Institut de Mathématiques de Toulouse
Decision trees
Variable 1
Varia
ble
2
x1 , x2 , … , xN are the observations (here: xi is a 2D point coordinate)
y1 , y2 , … , yN are the labels (here: 1 and -1)
The domain is subdivided into sub-domains to minimize the variance in each sub-domain (CART).
2.b) Classic algorithms - decision trees
L. Risser CNRS / Institut de Mathématiques de Toulouse
Decision trees
Variable 1
Varia
ble
2
D2D1
var2 < 3var2 > 3
D2
D1
All obs.
x1 , x2 , … , xN are the observations (here: xi is a 2D point coordinate)
y1 , y2 , … , yN are the labels (here: 1 and -1)
The domain is subdivided into sub-domains to minimize the variance in each sub-domain (CART).
2.b) Classic algorithms - decision trees
L. Risser CNRS / Institut de Mathématiques de Toulouse
Decision trees
Variable 1
Varia
ble
2
All obs.
D2D1
var2 < 3var2 > 3
D3
D1
D4D3
var1 < 1var1 > 1
D4
x1 , x2 , … , xN are the observations (here: xi is a 2D point coordinate)
y1 , y2 , … , yN are the labels (here: 1 and -1)
The domain is subdivided into sub-domains to minimize the variance in each sub-domain (CART).
2.b) Classic algorithms - decision trees
L. Risser CNRS / Institut de Mathématiques de Toulouse
Decision trees
Variable 1
Varia
ble
2
D2D1
var2 < 3var2 > 3
D3
D1
D4D3
var1 < 1var1 > 1
………
…
All obs.
x1 , x2 , … , xN are the observations (here: xi is a 2D point coordinate)
y1 , y2 , … , yN are the labels (here: 1 and -1)
The domain is subdivided into sub-domains to minimize the variance in each sub-domain (CART).
2.b) Classic algorithms - decision trees
L. Risser CNRS / Institut de Mathématiques de Toulouse
Random forests
Variable 1
Varia
ble
2
x1 , x2 , … , xN are the observations (here: xi is a 2D point coordinate)
y1 , y2 , … , yN are the labels (here: 1 and -1)
High dimension: p >> 1 (p = 2 in the former example)
Learning : • Several trees are independently
defined. • Dimensions are randomly drawn.
Prediction : • A label predicted at a given point is the
one predicted by the majority of trees (bagging).
2.c) Classic algorithms - random forests
L. Risser CNRS / Institut de Mathématiques de Toulouse
Variable 1
Varia
ble
2
Support Vector Machine (SVM)
x1 , x2 , … , xN are the observations (here: xi is a 2D point coordinate)
y1 , y2 , … , yN are the labels (here: 1 and -1)
2.d) Classic algorithms - Support Vector Machine (SVM)
L. Risser CNRS / Institut de Mathématiques de Toulouse
We estimate w and b such as:
yi (w . xi - b) ≥ 1 for all 1 ≤ i ≤ n
Variable 1
Varia
ble
2
Support Vector Machine (SVM)
w
(w . x
- b) =
1
(w . x
- b) =
0
(w . x
- b) =
-1
x1 , x2 , … , xN are the observations (here: xi is a 2D point coordinate)
y1 , y2 , … , yN are the labels (here: 1 and -1)
2.d) Classic algorithms - Support Vector Machine (SVM)
L. Risser CNRS / Institut de Mathématiques de Toulouse
Variable 1
Varia
ble
2
Support Vector Machine (SVM)
x1 , x2 , … , xN are the observations (here: xi is a 2D point coordinate)
y1 , y2 , … , yN are the labels (here: 1 and -1)
We estimate w and b such as:
yi (w . xi - b) ≥ 1 for all 1 ≤ i ≤ n
Full classification with a linear model is impossible now!!!
2.d) Classic algorithms - Support Vector Machine (SVM)
L. Risser CNRS / Institut de Mathématiques de Toulouse
Variable 1
Varia
ble
2
We estimate w et b that minimize:
[ n-1 Σ max( 0 , 1 - yi (w . xi - b) ) ] + λ || w ||i = 1
n
2
Support Vector Machine (SVM)
w
x1 , x2 , … , xN are the observations (here: xi is a 2D point coordinate)
y1 , y2 , … , yN are the labels (here: 1 and -1)
2
> 0 si yi is not well predicted
2.d) Classic algorithms - Support Vector Machine (SVM)
L. Risser CNRS / Institut de Mathématiques de Toulouse
Variable 1
Varia
ble
2
We estimate w et b that minimize:
[ n-1 Σ max( 0 , 1 - yi (w . xi - b) ) ] + λ || w ||i = 1
n
2
Support Vector Machine (SVM)
w
x1 , x2 , … , xN are the observations (here: xi is a 2D point coordinate)
y1 , y2 , … , yN are the labels (here: 1 and -1)
2
> 0 si yi is not well predicted
Remark: It is possible (and common) to use non-linear separations by replacing vector products with non-linear relations.
2.d) Classic algorithms - Support Vector Machine (SVM)
L. Risser CNRS / Institut de Mathématiques de Toulouse
2.d) Classic algorithms - Support Vector Machine (SVM)
Support Vector Machine (SVM) — Kernel methods
How to handle this case using a Linear model?
Variable 1
Varia
ble
2
L. Risser CNRS / Institut de Mathématiques de Toulouse
2.d) Classic algorithms - Support Vector Machine (SVM)
Variable 1
Varia
ble
2
We denote an observation
We classify the instead of theΦ(xi) = (x1i , x2
i , (x1i )2)
xi = (x1i , x2
i )
xi
Variable 1
Varia
ble
3
Var 2
Support Vector Machine (SVM) — Kernel methods
How to handle this case using a Linear model?
… there exists a huge literature on this topic
L. Risser CNRS / Institut de Mathématiques de Toulouse
2.e) Classic algorithms - Logistic regression
Logistic regression
For each observation :
• Explicative variable: where
• Response variable:
Estimate as the minimizer of:
Remarks: • Linear classification as by using linear SVM but different model with strong statistical insights • Scales particularly well when n and/or p is very high • Modeling constrains on is gold standard when
xi = (x1i , x2
i , …, xpi )
yi ∈ {−1,1}
p > > 1
i ∈ {1,…, n}
(w
b +p
∑j=1
wjxji > 0b +
p
∑j=1
wjxji < 0
yi = 1yi = − 1
w p > > n
Ideally
w= {w1, …, wp}, b)(w
L. Risser CNRS / Institut de Mathématiques de Toulouse
3) Over-fitting and cross validation
Overfitting and cross-validation
L. Risser CNRS / Institut de Mathématiques de Toulouse
3.a) Over-fitting and cross validation — overfitting
Overfitting
Full decision tree Linear SVM or Logistic RegressionTruncated decision tree
Training data:
L. Risser CNRS / Institut de Mathématiques de Toulouse
Area
Area
Area
Area
Area Area
Area
Area
Area Area Area Area
Overfitting
Full decision tree Truncated decision tree
Training data:
100% accuracy 95% accuracy 95% accuracy
Linear SVM or Logistic Regression
3.a) Over-fitting and cross validation — overfitting
L. Risser CNRS / Institut de Mathématiques de Toulouse
Area
Area
Area
Area
Area Area
Area
Area
Area Area Area Area
Overfitting
Full decision tree Truncated decision tree
Training data:
100% accuracy 95% accuracy 95% accuracy
Which strategy would you trust most to predict the label of new observations ?
Linear SVM or Logistic Regression
3.a) Over-fitting and cross validation — overfitting
L. Risser CNRS / Institut de Mathématiques de Toulouse
All available data to learn decision rules
3.b) Over-fitting and cross validation — cross-validation
Cross-validation: Fundamental paradigm of Machine Learning to validate trained models
L. Risser CNRS / Institut de Mathématiques de Toulouse
Split the data into training data and test data
Test dataTraining data
3.b) Over-fitting and cross validation — cross-validation
L. Risser CNRS / Institut de Mathématiques de Toulouse
Learn the model parameters on training data
Test dataTraining data
93.75% accuracy in this example
3.b) Over-fitting and cross validation — cross-validation
L. Risser CNRS / Institut de Mathématiques de Toulouse
Evaluate the model quality with no risk of overfitting on test data
100% accuracy in this example
Test dataTraining data
3.b) Over-fitting and cross validation — cross-validation
L. Risser CNRS / Institut de Mathématiques de Toulouse
K-folds: K tests to be more robust and additionally evaluate the model stability
Test: 100% acc.Learn
1st 4-fold 2nd 4-fold
3rd 4-fold 4th 4-fold
3.b) Over-fitting and cross validation — cross-validation
Learn
Learn
LearnTest: 100% acc. Test: 100% acc.
Test: 80% acc.
95% average accuracy and stable parameters… good stuff
L. Risser CNRS / Institut de Mathématiques de Toulouse
Leave-1-out: n tests (recommended when n is small)
Test: sucessLeave x1 Leave x2 Leave x3
Leave x4 Leave x5 Leave x7
Leave x8 Leave x9
…
3.b) Over-fitting and cross validation — cross-validation
Learn
Test: sucessLearn
Test: sucessLearn
Test: failedLearn
Test: sucessLearn
Test: sucessLearn
Test: sucessLearn
Test: sucessLearn
L. Risser CNRS / Institut de Mathématiques de Toulouse
4) High dimensionality
High dimensionality… regularization, model selection, and/or dimensionality reduction
L. Risser CNRS / Institut de Mathématiques de Toulouse
4.a) High dimensionality — Example of problem
Project context : • Observations = MRI images of the brain at different acquisition times (ADNI*) • Labels = Patient state (MCI/AD) • Prediction of Alzheimer disease depending on the hippocampus morphological evolution?
[Baseline] [Baseline + 12 months]
Initial data: • [Baseline] : n = 103 patients are MCI • [Baseline + 12 months] : 84 patients are MCI / 19 patients are AD
Hippocampus
* http://adni.bmap.ucla.edu/
L. Risser CNRS / Institut de Mathématiques de Toulouse
For each of the n = 103 observations (patients): • xi : Evolution marker on the Template p = 20000 points • yi : State AD or MCI
Questions: • Is it possible to discriminate MCI and AD patients based on the shape evolution? • How to learn the most discriminant markers?
103 x
Subject 2 Baseline
Subject 84 Baseline
Subject 86 Baseline
Subject 103 Baseline
... ...
Subject 2 Baseline
+ 12 months
Subject 84 Baseline
+ 12 months
Subject 86 Baseline
+ 12 months
Subject 103 Baseline
+ 12 months... ...
MCI Group
Treatment 1: Estimate deformations between [Baseline] and [Baseline + 12 months] [Ourselin et al. Im Vis Comp., 2001], [Vialard et al. IJCV, 2012]
Subject 85 Baseline
Subject 85 Baseline
+ 12 months
Subject 1 Baseline
Subject 1 Baseline
+ 12 months
Registration
AD Group
Treatment 2 : Transport evolution markers on an Template/average shape
4.a) High dimensionality — Example of problem
Registration Registration Registration Registration Registration
L. Risser CNRS / Institut de Mathématiques de Toulouse
4.a) High dimensionality — Example of problem
where: X ∈ Rn*p : matrix of the n = 103 observations in dimension p = 20000 y ∈ {∓1}n : State ( AD = -1 / MCI = 1 ) (w , b) ∈ Rp * R : parameters to estimate
= F( + b )
# points
# su
jets
# po
ints1
1
y X wdef.
Logistic regression predictive model that defines the probability of the yi depending on the xi :
Log-likelihood optimization:
# su
jets
where:
Regularization parameter (mandatory as n>p)
L. Risser CNRS / Institut de Mathématiques de Toulouse
Log-likelihood optimization: Regularization parameter (mandatory as n>p)
4.a) High dimensionality — Example of problem
where: X ∈ Rn*p : matrix of the n = 103 observations in dimension p = 20000 y ∈ {∓1}n : State ( AD = -1 / MCI = 1 ) (w , b) ∈ Rp * R : parameters to estimate
= F( + b )
# points
# su
jets
# po
ints1
1
y X wdef.
Logistic regression predictive model that defines the probability of the yi depending on the xi :
# su
jets
where:
2x = 3
2x1 + 3x2 = 3
2x1 + 3x2 = 33x1 + 1x2 = 1
2x1 + 3x2 + 1x3 − x4 = 15x1 − x2 + 2x3 + x4 = 1
n = 1 et p = 1 OK
n = 1 et p = 2 KO
n = 2 et p = 2 OK
n = 2 et p = 4 KO
L. Risser CNRS / Institut de Mathématiques de Toulouse
4.b) High dimensionality — Effect of regularization
Optimization of w using: Lewis & Overton, Nonsmooth optimization via quasi-Newton methods. Math. Programming 2012
Tested regularization models:
(1) Ridge :
(2) LASSO :
(3) Elastic net :
(4) Sobolev semi-norm:
(5) Total Variation :
(6) Fused LASSO :
L. Risser CNRS / Institut de Mathématiques de Toulouse
4.b) High dimensionality — Effect of regularization (and model selection)
(1) Ridge (4) Sobolev semi-norm
(2) LASSO (5) Total Variation
(3) Elastic net (6) Fused LASSO
Representation of w for three λ on a slice of the hippocampus: • Blue and red: strong local influence • Green: little or no local influence
L. Risser CNRS / Institut de Mathématiques de Toulouse
4.b) High dimensionality — Effect of regularization
Results obtained using a cross validation method (here leave-10%-out) : • Spec+Sens = 2 good prediction in 100% of the cases • Spec+Sens = 1 coin flipping (Heads or Tails) has the same predictive power • Spec+Sens = 0 good prediction in 0% of the cases
Best results obtained using a regularization pertinent with regard to the data: • Spatial distribution taken into account • Allows clear transitions
[Fiot J.B. et al., NeuroImage: Clinical, 2012]
L. Risser CNRS / Institut de Mathématiques de Toulouse
4.c) High dimensionality — Dimensionality reduction using SVD
National records (in seconds) of p = 9 athletic events for n = 26 countries 100m 200m 400m 800m 1500m 5000m 10000m SemiMarathon Marathon Australie 9.93 20.06 44.38 104.40 211.96 775.76 1649.73 3602 7671 Belgique 10.02 20.19 44.78 103.86 214.13 769.71 1612.30 3605 7640 Brésil 10.00 19.89 44.29 101.77 213.25 799.43 1648.12 3573 7565 RoyaumeUni 9.87 19.87 44.36 101.73 209.67 780.41 1638.14 3609 7633 Canada 9.84 20.17 44.44 103.68 211.71 793.96 1656.01 3650 7809 Chine 10.17 20.54 45.25 106.44 216.49 805.14 1670.00 3635 7695 Croatie 10.25 20.76 45.64 104.07 213.30 817.76 1704.32 3827 8225 Ethiopie 10.50 21.08 45.89 106.08 211.13 757.35 1577.53 3535 7439 France 9.99 20.16 44.46 103.15 208.98 778.83 1642.78 3658 7596 Allemagne 10.06 20.20 44.33 103.65 211.58 774.70 1641.53 3634 7727 Inde 10.30 20.73 45.48 105.77 218.00 809.70 1682.89 3672 7920 Iran 10.29 21.11 46.37 104.74 218.80 833.40 1762.65 4103 8903 Italie 10.01 19.72 45.19 103.17 212.78 785.59 1636.50 3620 7642 Jamaïque 9.58 19.19 44.49 105.21 219.19 813.10 1712.44 3816 8199 Japon 10.00 20.03 44.78 106.18 217.42 793.20 1655.09 3625 7576 Kenya 10.26 20.43 44.18 102.01 206.34 759.74 1587.85 3513 7467 Lituanie 10.33 20.88 45.73 106.64 220.90 797.90 1651.50 3851 7955 NouvelleZélande 10.11 20.42 46.09 104.30 212.17 790.19 1661.95 3732 7815 Portugal 9.86 20.01 46.11 104.91 210.07 782.86 1632.47 3665 7596 Russie 10.10 20.23 44.60 102.47 212.28 791.99 1673.12 3675 7747 AfriqueduSud 10.06 20.11 44.59 102.69 213.56 794.16 1649.94 3678 7593 Espagne 10.14 20.59 44.96 103.83 208.95 782.54 1634.44 3592 7562 Suède 10.18 20.30 44.56 105.54 216.49 797.59 1675.74 3655 7838 Suisse 10.16 20.41 44.99 102.55 211.75 787.54 1673.16 3686 7643 Ukraine 10.07 20.00 45.11 105.08 210.33 790.78 1679.80 3711 7635 USA 9.69 19.32 43.18 102.60 209.30 776.27 1633.98 3583 7538
How to establish a general ranking between these countries ???
Weighted sum between the scores, then ranking of these sums.
L. Risser CNRS / Institut de Mathématiques de Toulouse
4.c) High dimensionality — Dimensionality reduction using SVD
Weighted sum of the scores is equivalent to a matrix x vector multiplication:
Vector containing the scores = M . w
National records (in seconds) of p = 9 athletic events for n = 26 countries 100m 200m 400m 800m 1500m 5000m 10000m SemiMarathon Marathon Australie 9.93 20.06 44.38 104.40 211.96 775.76 1649.73 3602 7671 Belgique 10.02 20.19 44.78 103.86 214.13 769.71 1612.30 3605 7640 Brésil 10.00 19.89 44.29 101.77 213.25 799.43 1648.12 3573 7565 RoyaumeUni 9.87 19.87 44.36 101.73 209.67 780.41 1638.14 3609 7633 Canada 9.84 20.17 44.44 103.68 211.71 793.96 1656.01 3650 7809 Chine 10.17 20.54 45.25 106.44 216.49 805.14 1670.00 3635 7695 Croatie 10.25 20.76 45.64 104.07 213.30 817.76 1704.32 3827 8225 Ethiopie 10.50 21.08 45.89 106.08 211.13 757.35 1577.53 3535 7439 France 9.99 20.16 44.46 103.15 208.98 778.83 1642.78 3658 7596 Allemagne 10.06 20.20 44.33 103.65 211.58 774.70 1641.53 3634 7727 Inde 10.30 20.73 45.48 105.77 218.00 809.70 1682.89 3672 7920 Iran 10.29 21.11 46.37 104.74 218.80 833.40 1762.65 4103 8903 Italie 10.01 19.72 45.19 103.17 212.78 785.59 1636.50 3620 7642 Jamaïque 9.58 19.19 44.49 105.21 219.19 813.10 1712.44 3816 8199 Japon 10.00 20.03 44.78 106.18 217.42 793.20 1655.09 3625 7576 Kenya 10.26 20.43 44.18 102.01 206.34 759.74 1587.85 3513 7467 Lituanie 10.33 20.88 45.73 106.64 220.90 797.90 1651.50 3851 7955 NouvelleZélande 10.11 20.42 46.09 104.30 212.17 790.19 1661.95 3732 7815 Portugal 9.86 20.01 46.11 104.91 210.07 782.86 1632.47 3665 7596 Russie 10.10 20.23 44.60 102.47 212.28 791.99 1673.12 3675 7747 AfriqueduSud 10.06 20.11 44.59 102.69 213.56 794.16 1649.94 3678 7593 Espagne 10.14 20.59 44.96 103.83 208.95 782.54 1634.44 3592 7562 Suède 10.18 20.30 44.56 105.54 216.49 797.59 1675.74 3655 7838 Suisse 10.16 20.41 44.99 102.55 211.75 787.54 1673.16 3686 7643 Ukraine 10.07 20.00 45.11 105.08 210.33 790.78 1679.80 3711 7635 USA 9.69 19.32 43.18 102.60 209.30 776.27 1633.98 3583 7538
How to establish a general ranking between these countries ???
Weighted sum between the scores, then ranking of these sums.
L. Risser CNRS / Institut de Mathématiques de Toulouse
4.c) High dimensionality — Dimensionality reduction using SVD
One can also look for a vector of norm 1 that maximizes the variability between the scores
Optimal vector = 1st eigenvector (v1) of the SVD Variability level = 1st eigenvalue (λ1) of the SVD
Vector of scores with the highest variability = M . v1
Matrix M
National records (in seconds) of p = 9 athletic events for n = 26 countries 100m 200m 400m 800m 1500m 5000m 10000m SemiMarathon Marathon Australie 9.93 20.06 44.38 104.40 211.96 775.76 1649.73 3602 7671 Belgique 10.02 20.19 44.78 103.86 214.13 769.71 1612.30 3605 7640 Brésil 10.00 19.89 44.29 101.77 213.25 799.43 1648.12 3573 7565 RoyaumeUni 9.87 19.87 44.36 101.73 209.67 780.41 1638.14 3609 7633 Canada 9.84 20.17 44.44 103.68 211.71 793.96 1656.01 3650 7809 Chine 10.17 20.54 45.25 106.44 216.49 805.14 1670.00 3635 7695 Croatie 10.25 20.76 45.64 104.07 213.30 817.76 1704.32 3827 8225 Ethiopie 10.50 21.08 45.89 106.08 211.13 757.35 1577.53 3535 7439 France 9.99 20.16 44.46 103.15 208.98 778.83 1642.78 3658 7596 Allemagne 10.06 20.20 44.33 103.65 211.58 774.70 1641.53 3634 7727 Inde 10.30 20.73 45.48 105.77 218.00 809.70 1682.89 3672 7920 Iran 10.29 21.11 46.37 104.74 218.80 833.40 1762.65 4103 8903 Italie 10.01 19.72 45.19 103.17 212.78 785.59 1636.50 3620 7642 Jamaïque 9.58 19.19 44.49 105.21 219.19 813.10 1712.44 3816 8199 Japon 10.00 20.03 44.78 106.18 217.42 793.20 1655.09 3625 7576 Kenya 10.26 20.43 44.18 102.01 206.34 759.74 1587.85 3513 7467 Lituanie 10.33 20.88 45.73 106.64 220.90 797.90 1651.50 3851 7955 NouvelleZélande 10.11 20.42 46.09 104.30 212.17 790.19 1661.95 3732 7815 Portugal 9.86 20.01 46.11 104.91 210.07 782.86 1632.47 3665 7596 Russie 10.10 20.23 44.60 102.47 212.28 791.99 1673.12 3675 7747 AfriqueduSud 10.06 20.11 44.59 102.69 213.56 794.16 1649.94 3678 7593 Espagne 10.14 20.59 44.96 103.83 208.95 782.54 1634.44 3592 7562 Suède 10.18 20.30 44.56 105.54 216.49 797.59 1675.74 3655 7838 Suisse 10.16 20.41 44.99 102.55 211.75 787.54 1673.16 3686 7643 Ukraine 10.07 20.00 45.11 105.08 210.33 790.78 1679.80 3711 7635 USA 9.69 19.32 43.18 102.60 209.30 776.27 1633.98 3583 7538
L. Risser CNRS / Institut de Mathématiques de Toulouse
4.c) High dimensionality — Dimensionality reduction using SVD
One can now search the vector of norm 1, orthogonal to v1 , that maximizes the variability
Optimal vector = 2nd eigenvector (v2) of the PCA Variability level = 2nd eigenvalue (λ2) dof the PCA
… and so on ...Can be calculated analytically
Matrix M
National records (in seconds) of p = 9 athletic events for n = 26 countries 100m 200m 400m 800m 1500m 5000m 10000m SemiMarathon Marathon Australie 9.93 20.06 44.38 104.40 211.96 775.76 1649.73 3602 7671 Belgique 10.02 20.19 44.78 103.86 214.13 769.71 1612.30 3605 7640 Brésil 10.00 19.89 44.29 101.77 213.25 799.43 1648.12 3573 7565 RoyaumeUni 9.87 19.87 44.36 101.73 209.67 780.41 1638.14 3609 7633 Canada 9.84 20.17 44.44 103.68 211.71 793.96 1656.01 3650 7809 Chine 10.17 20.54 45.25 106.44 216.49 805.14 1670.00 3635 7695 Croatie 10.25 20.76 45.64 104.07 213.30 817.76 1704.32 3827 8225 Ethiopie 10.50 21.08 45.89 106.08 211.13 757.35 1577.53 3535 7439 France 9.99 20.16 44.46 103.15 208.98 778.83 1642.78 3658 7596 Allemagne 10.06 20.20 44.33 103.65 211.58 774.70 1641.53 3634 7727 Inde 10.30 20.73 45.48 105.77 218.00 809.70 1682.89 3672 7920 Iran 10.29 21.11 46.37 104.74 218.80 833.40 1762.65 4103 8903 Italie 10.01 19.72 45.19 103.17 212.78 785.59 1636.50 3620 7642 Jamaïque 9.58 19.19 44.49 105.21 219.19 813.10 1712.44 3816 8199 Japon 10.00 20.03 44.78 106.18 217.42 793.20 1655.09 3625 7576 Kenya 10.26 20.43 44.18 102.01 206.34 759.74 1587.85 3513 7467 Lituanie 10.33 20.88 45.73 106.64 220.90 797.90 1651.50 3851 7955 NouvelleZélande 10.11 20.42 46.09 104.30 212.17 790.19 1661.95 3732 7815 Portugal 9.86 20.01 46.11 104.91 210.07 782.86 1632.47 3665 7596 Russie 10.10 20.23 44.60 102.47 212.28 791.99 1673.12 3675 7747 AfriqueduSud 10.06 20.11 44.59 102.69 213.56 794.16 1649.94 3678 7593 Espagne 10.14 20.59 44.96 103.83 208.95 782.54 1634.44 3592 7562 Suède 10.18 20.30 44.56 105.54 216.49 797.59 1675.74 3655 7838 Suisse 10.16 20.41 44.99 102.55 211.75 787.54 1673.16 3686 7643 Ukraine 10.07 20.00 45.11 105.08 210.33 790.78 1679.80 3711 7635 USA 9.69 19.32 43.18 102.60 209.30 776.27 1633.98 3583 7538
L. Risser CNRS / Institut de Mathématiques de Toulouse
Black: Projection of the data on PC2 and PC3
Red: Influence of the variables in PC2 and PC3
4.c) High dimensionality — Dimensionality reduction using SVD
Fantastic tool to visualize and interpret high dimensional data …
L. Risser CNRS / Institut de Mathématiques de Toulouse
Scree plot of the eigenvalues
4.c) High dimensionality — Dimensionality reduction using SVD
λ1 λ2 λ3 λ4 λ5 λ6 λ7 λ8
… and a powerful tool to reduce the problem dimensionality before training a M.L. model
Cap
ture
d va
riabi
lity
proj.PC1 proj.PC2 proj.PC3 Australie … … … Belgique … … … Brésil … … … RoyaumeUni … … … Canada … … … Chine … … … Croatie … … … Ethiopie … … … France … … … Allemagne … … … …
100m 200m 400m 800m 1500m 5000m 10000m SemiMarathon Marathon Australie 9.93 20.06 44.38 104.40 211.96 775.76 1649.73 3602 7671 Belgique 10.02 20.19 44.78 103.86 214.13 769.71 1612.30 3605 7640 Brésil 10.00 19.89 44.29 101.77 213.25 799.43 1648.12 3573 7565 RoyaumeUni 9.87 19.87 44.36 101.73 209.67 780.41 1638.14 3609 7633 Canada 9.84 20.17 44.44 103.68 211.71 793.96 1656.01 3650 7809 Chine 10.17 20.54 45.25 106.44 216.49 805.14 1670.00 3635 7695 Croatie 10.25 20.76 45.64 104.07 213.30 817.76 1704.32 3827 8225 Ethiopie 10.50 21.08 45.89 106.08 211.13 757.35 1577.53 3535 7439 France 9.99 20.16 44.46 103.15 208.98 778.83 1642.78 3658 7596 Allemagne 10.06 20.20 44.33 103.65 211.58 774.70 1641.53 3634 7727 …
Projection of the data from a 9D space to a 3D space preserves
here 90% of their variability
!!!
L. Risser CNRS / Institut de Mathématiques de Toulouse
5) Supervised learning using Neural Networks
Supervised learning using Neural Networks
L. Risser CNRS / Institut de Mathématiques de Toulouse
https://pythonprogramming.net/neural-networks-machine-learning-tutorial/
Deep learning …
Known PredictedLearned
• Very efficient in important applications (signal, images).
• Computationally heavy learning phase but quick predictions.
• Very large amount of parameters to learn.
• Requires large databases of annotated data (or wise network designs).
5) Supervised learning using Neural Networks
Laurent Risser CNRS / Institut de Mathématiques de Toulouse
Prediction
I = Image RGB 200*200
h1(I)
h2(I)
Prediction
Image of dog or cat
• If dog: h1(I) == 0 • If cat: h1(I) == 1 • If nice: h2(I) == 0 • If aggressive: h2(I) == 1
Classifier (black-box)
5) Supervised learning using Neural Networks — User point of view
Laurent Risser CNRS / Institut de Mathématiques de Toulouse
1
0
5) Supervised learning using Neural Networks — User point of view
Image of dog or cat
• If dog: h1(I) == 0 • If cat: h1(I) == 1 • If nice: h2(I) == 0 • If aggressive: h2(I) == 1
Classifier (black-box)
Prediction
Laurent Risser CNRS / Institut de Mathématiques de Toulouse
1
1
5) Supervised learning using Neural Networks — User point of view
Image of dog or cat
• If dog: h1(I) == 0 • If cat: h1(I) == 1 • If nice: h2(I) == 0 • If aggressive: h2(I) == 1
Classifier (black-box)
Prediction
Laurent Risser CNRS / Institut de Mathématiques de Toulouse
0
0
5) Supervised learning using Neural Networks — User point of view
Image of dog or cat
• If dog: h1(I) == 0 • If cat: h1(I) == 1 • If nice: h2(I) == 0 • If aggressive: h2(I) == 1
Classifier (black-box)
Prediction
Laurent Risser CNRS / Institut de Mathématiques de Toulouse
Training phase: Parameters optimization to get the best predictions in average
0
0
Input training data Many images of dogs and cats
Output training data Labels of each image
… ……
…
…
…
…
0
00
0
……
……
5) Supervised learning using Neural Networks — User point of view
Classifier (black-box)
Laurent Risser CNRS / Institut de Mathématiques de Toulouse
The xi are typically the intensities of an RGB image I in each of its channel Predicted labels
5) Supervised learning using Neural Networks — Into the black-box
Input layer Hidden layers Output layer
Laurent Risser CNRS / Institut de Mathématiques de Toulouse
5) Supervised learning using Neural Networks — Into the black-box
Layer 1 Layer 2 Layer 3 Layer L
Laurent Risser CNRS / Institut de Mathématiques de Toulouse
5) Supervised learning using Neural Networks — Into the black-box
Layer 1 Layer 2 Layer 3 Layer L
Laurent Risser CNRS / Institut de Mathématiques de Toulouse
5) Supervised learning using Neural Networks — Into the black-box
Layer 1 Layer 2 Layer 3 Layer L
Laurent Risser CNRS / Institut de Mathématiques de Toulouse
5) Supervised learning using Neural Networks — Into the black-box
Layer 1 Layer 2 Layer 3 Layer L
Laurent Risser CNRS / Institut de Mathématiques de Toulouse
Minimize the expectation of the prediction error ( ! its average on the K training observations): ≈
5) Supervised learning using Neural Networks — Into the black-box
Layer 1 Layer 2 Layer 3 Layer L
Predicted labels
Known labels
Laurent Risser CNRS / Institut de Mathématiques de Toulouse
Stochastic gradient descent:
where
• Calculated on a subsample of the K observations
at each iteration (batch)
• Calculated analytically if l=L-1
• Back-propagated if l<L-1
5) Supervised learning using Neural Networks — Into the black-box
Layer 1 Layer 2 Layer 3 Layer L
Laurent Risser CNRS / Institut de Mathématiques de Toulouse
Optimisation de l’espérance (moyenne) de :
In practice: • Various types of layers • Various types of architectures • Various strategies to perform the stochastic gradient descent
Hidden important properties of N.N.: • Prediction and training can be straightforwardly parallelized on GPUs • Nvidia cuDNN library massively used by Keras, TensorFlow, Theano, PyTorch, …
5) Supervised learning using Neural Networks — Into the black-box
Layer 1 Layer 2 Layer 3 Layer L
L. Risser CNRS / Institut de Mathématiques de Toulouse
That’s all for now
MERCI !!!