introduction to machine learning and deep learning...introduction to deep learning feed-forward...

51
Introduction to Machine Learning and Deep Learning CNRS Formation Entreprise St´ ephane Ga¨ ıffas

Upload: others

Post on 30-Dec-2020

21 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation

Introduction to Machine Learning and Deep

Learning

CNRS Formation Entreprise

Stephane Gaıffas

Page 2: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation

Who are we ?

Stephane Gaıffas

Professor

LPSM, Universite de Paris

DMA, Ecole normale superieure

http://stephanegaiffas.github.io

[email protected]

Page 3: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation

Who are we ?

Page 4: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation

Webpage of the formation

https://stephanegaiffas.github.io/teaching/formation_cnrs

Practical sessions using Python 3 / jupyter notebook withscikit-learn and tensorflow with keras for deep learning

Tools

Page 5: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation

How to get all the tools ? Simplest way is to use docker

Follow the steps described on the webpage:https://stephanegaiffas.github.io/teaching/formation_cnrs

Page 6: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation

Agenda: Day 1

Morning Course

Introduction

Binary classification, Naive Bayes

Logistic regression

Standard metrics and recipes: overfitting, cross-validation

Regularization (Ridge, Lasso)

Afternoon practical session

Text classification

Scoring

Page 7: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation

Agenda: Day 2

Morning Course

Decision trees, CART

Random Forests

Boosting, Gradient Boosting

Afternoon practical session

Prediction of load curves

Continuous labels / time series

Page 8: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation

Agenda: Day 3

Morning Course

Introduction to Deep Learning

Feed-forward neural networks

Convolutional neural networks

Stochastic gradient descent, back-propagation

Regularization: dropout, penalization, early stopping

Afternoon practical session

Image classification: MNIST and NotMNIST

Comparison of different architectures

Text classification again

Page 9: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation

Roundtable

Page 10: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation

Teasers: data science or statistics?

Page 11: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation

Teasers: an application in marketing (Real Time Bidding)

A customer visits a webpage with his browser: a complexprocess of content selection and delivery begins.

An advertiser might want to display an ad on the webpagewhere the user is going. The webpage belongs to a publisher.

The publisher sells ad space to advertisers who want to reachcustomers

In some cases, an auction starts: RTB (Real Time Bidding)

Page 12: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation

Teasers: an application in marketing (Real Time Bidding)

Advertisers have 10ms (!) to give a price: they need toassess quickly how willing they are to display the ad to thiscustomer

Machine learning is used here to predict the probability ofclick on the ad. Time constraint: few model parameters toanswer quickly

Feature selection / dimension reduction is crucial here

Full process takes < 100ms

Page 13: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation

Teasers: an application in marketing (Real Time Bidding)

Some figures:

10 million prediction of click probability per second

answers within 10ms

stores 20Terabytes of data daily

Aim

Based on past data, you want to find users that will click onsome ads

This problem can be formulated as a binary classificationproblem

Page 14: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation

Classification = supervised learning with a binary label

Setting

You have past/historical data, containing data aboutindividuals i = 1, . . . , n

You have a features vector xi ∈ Rd for each individual i

For each i , you know if he/she clicked (yi = 1) or not(yi = −1)

We call yi ∈ {−1, 1} the label of i

(xi , yi ) are i.i.d realizations of (X ,Y )

Aim

Given a features vector x (with no corresponding label),predict a label y ∈ {−1, 1}Use data D = {(x1, y1), . . . , (xn, yn)} to construct a classifier

Page 15: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation

Many ways to separate points!

Page 16: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation

Today: model-based classification

Naive Bayes

Logistic regression

Penalization

Cross-validation

Page 17: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation

Probabilistic / statistical approach

Model the distribution P(y |x) of y knowing or conditionallyon x (denoted y |x)

Construct estimators P(1|x) and P(−1|x) of

P(1|x) and P(−1|x) = 1− P(1|x)

Given x , classify using

y =

{1 if P(1|x) ≥ t

−1 otherwise

for some threshold t ∈ (0, 1)

Page 18: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation

Bayes formula. We know that

P(y |x) =P(x |y)P(y)

P(x)=

P(x |y)P(y)∑y ′=−1,1 P(x |y ′)P(y ′)

If we know the distribution of x |y and the distribution of y , weknow the distribution of y |x

Bayes classifier. Classify using Bayes formula, given that:

We model P(x |y)

We are able to estimate P(x |y) based on data

Page 19: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation

Maximum a posteriori. Classify using the discriminant functions

δ(y |x) = logP(x |y) + logP(y)

for y = 1,−1 and decide (largest, beyond a threshold, etc.)

Remark.

Different models on the distribution of x |y leads to differentclassifiers

The simplest one is the Naive Bayes

Page 20: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation

Naive Bayes. A crude modeling for P(x |y): assume that features(coordinates) xj of x are independent conditionally on y :

P(x |y) =d∏

j=1

P(x j |y)

and model the univariate distribution xj |y , for instance

P(xj |y) = Normal(µj ,y , σ2j ,y ),

with parameters µj ,y and σ2j ,y easily estimated by averages

If a feature is discrete, use a Bernoulli or multinomialdistribution

Leads to a classifier which is very easy to compute

Requires only the computation of some averages

Page 21: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation

Important notation. If a, b ∈ Rd two vectors (same dimension d)then

a>b = 〈a, b〉 =d∑

j=1

ajbj

is the inner product between a and b

Remark. The ”simplest” (non-constant) function Rd → R is alinear function

x 7→ x>w + b

with w ∈ Rd and b ∈ R

For d = 1 it’s simply a straight line

Page 22: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation

Logistic regression

By far the most widely used classification algorithm

We want to explain the label y based on x , we want to“regress” y on x

Models the conditional distribution y |xFor y ∈ {−1, 1}, we consider the model

P(1|x) = σ(x>w + b)

where w ∈ Rd is a vector of model weights and b ∈ R is theintercept, and where σ is the sigmoid function

σ(z) =1

1 + e−z

Page 23: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation

The sigmoid choice really is a choice. It is a modellingchoice.

It’s a way to map R→ [0, 1] (we want to model a probability)

We could also consider

P(1|x) = F (x>w + b)

for any distribution function F . Another popular choice is theGaussian distribution

F (z) = P(N(0, 1) ≤ z),

which leads to another loss called probit

Page 24: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation

However, the sigmoid choice has the following niceinterpretation: an easy computation leads to

log( P(1|x)

P(−1|x)

)= x>w + b

This quantity is called the log-odd ratio

Note thatP(1|x) ≥ P(−1|x)

iffx>w + b ≥ 0.

This is a linear classification rule

Linear with respect to the considered features x

But, you choose the features: features engineering (moreon that later)

Page 25: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation

Estimation of w and b

We have a model for y |xData (xi , yi ) is assumed i.i.d with the same distribution as(x , y)

Compute estimators w and b by maximum likelihoodestimation

Or equivalently, minimize the minus log-likelihood

More generally, when a model is used

Goodness-of-fit = −log likelihood

log is used mainly since averages are easier to study (and tocompute) than products

Page 26: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation

For logistic regression, a computation lead to the formula

−loglikelihood(w , b) =n∑

i=1

log(1 + e−yi (x>i w+b))

So, in order to fit the logistic regression, we need to find

(w , b) ∈ argminw∈Rd ,b∈R

1

n

n∑i=1

log(1 + e−yi (x>i w+b))

(just added 1/n so that it’s an average)

Page 27: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation

Compute w and b as follows:

(w , b) ∈ argminw∈Rd ,b∈R

1

n

n∑i=1

log(1 + e−yi (x>i w+b))

If we introduce the logistic loss function

`(y , y ′) = log(1 + e−yy′)

then

(w , b) ∈ argminw∈Rd ,b∈R

1

n

n∑i=1

`(yi , x>i w + b)

Page 28: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation

A goodness-of-fit

(w , b) ∈ argminw∈Rd ,b∈R

1

n

n∑i=1

`(yi , x>i w + b)

is natural: it is an average of losses, one for each sample point

Note that

`(y , y ′) = log(1 + e−yy′) for logistic regression

`(y , y ′) = 12(y − y ′)2 for least-squares linear regression

These problems are convex and smooth

In most cases, training a model (such the logistic regression)means finding w and b that minimizes some function

Finding a minimizer requires to use an optimization algorithm (alsocalled solver)

Page 29: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation

Training = Optimization

Training the model means minimizing some function to findmodel weights w and b

Usually done in machine learning using first order informationi.e. gradient (derivative) of the function ∇f (w)

Simplest algorithm is gradient descent, which performsiteratively

w t+1 ← w t − ηt∇f (w t)

where t is the iteration counter of the algorithm and ηt is asequence of step-sizes, also called learning rates

Page 30: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation

Training = OptimizationMany solvers out there !

Page 31: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation

Other classical loss functions for binary classication

Hinge loss (SVM), `(y , y ′) = (1− yy ′)+

Quadratic hinge loss (SVM), `(y , y ′) = 12(1− yy ′)2+

Huber loss `(y , y ′) = −4yy ′1yy ′<−1 + (1− yy ′)2+1yy ′≥−1

These losses can be understood as a convex approximation ofthe 0/1 loss `(y , y ′) = 1yy ′≤0

Page 32: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation

Classifiers on toy datasets

Input data

.88

Naive Bayes

.88

Logistic regression

.70 .40

.95 .95

Page 33: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation

Standard error metrics in classification

Precision, Recall, F-Score, AUC

For each sample i we have

an actual label yi

a predicted label yi

We can construct the confusion matrix

with

TP =∑n

i=1 1yi=1,yi=1

TN =∑n

i=1 1yi=−1,yi=−1FN =

∑ni=1 1yi=1,yi=−1

FP =∑n

i=1 1yi=−1,yi=1

with yes = 1 and no = −1

Page 34: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation

Standard error metrics in classification

Precision =TP

#(predicted P)=

TP

TP + FP

Recall =TP

#(real P)=

TP

TP + FN

Accuracy =TP + TN

TP + TN + FP + FN

F-Score = 2Precision× Recall

Precision + Recall

Some vocabulary

Recall = Sensitivity

False-Discovery Rate FDR = 1− Precision

Page 35: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation

ROC Curve (Receiver Operating Characteristic)

Based on the estimated probabilities pi ,1 = P(1|xi )Each point At of the curve has coordinates (FPRt ,TPRt),where FPRt and TPRt are FPR and TPR of the confusionmatrix obtained by the classification rule

yi =

{1 if pi ,1 ≥ t

−1 otherwise

for a threshold t varying in [0, 1]

AUC score is the Area Under the ROC Curve

Page 36: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation
Page 37: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation

Penalization to avoid overfittingComputing

w , b ∈ argminw∈Rd ,b∈R

1

n

n∑i=1

`(yi , 〈xi ,w〉+ b)

generally leads to a bad classifier. Minimize instead

w , b ∈ argminw∈Rd ,b∈R

{1

n

n∑i=1

`(yi , 〈xi ,w〉+ b) +1

Cpen(w)

}where

pen is a penalization function, it forbids w to be “toocomplex”

C > 0 is a tuning or smoothing parameter, that balancesgoodness-of-fit and penalization

Page 38: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation

Penalization to avoid overfittingIn the problem

w , b ∈ argminw∈Rd ,b∈R

{1

n

n∑i=1

`(yi , 〈xi ,w〉+ b) +1

Cpen(w)

},

a well-chosen C > 0, allows to avoid overfitting

Page 39: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation

Overfitting is what youwant to avoid

Page 40: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation

Which penalization? The ridge penalization considers

pen(w) =1

2‖w‖22 =

1

2

d∑j=1

w2j

It penalizes the “magnitude” of w

This is the most widely used penalization

It’s nice and easy

It helps with correlated features

It actually helps training! (makes the problem even moreconvex...)

Page 41: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation

There is another desirable property on w

If wj = 0, then feature j has no impact on the prediction:

y = sign(〈x , w〉+ b)

If we have many features (d is large), it would be nice if wcontained zeros, and many of them

Means that only few features are statistically relevant.

Means that only few features are useful to predict the label

Leads to a simpler model, with a “reduced” dimension

How to do it ?

Page 42: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation

Tempting to use

w , b ∈ argminw∈Rd ,b∈ R

{1

n

n∑i=1

`(yi , 〈xi ,w〉+ b) +1

C‖w‖0

},

where‖w‖0 = #{j ∈ {1, . . . , d} : wj 6= 0}.

To solve this, explore all possible supports of w . Too long!(NP-hard)

Page 43: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation

A convex proxy of ‖ · ‖0: the `1-norm

‖w‖1 =d∑

j=1

|wj |

A computation gives that

argminz ′∈R

{1

2(z ′ − z)2 + λ|z ′|

}= sign(z)(|z | − λ)+.

for λ > 0 and z ∈ R. Example with λ = 1

Page 44: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation

Particular instances of problem

w , b ∈ argminw∈Rd ,b∈ R

{1

n

n∑i=1

`(yi , 〈xi ,w〉+ b) +1

Cpen(w)

},

For `(y , y ′) = 12(y − y ′)2 and pen(w) = 1

2‖w‖22, the problem is

called ridge regression

For `(y , y ′) = 12(y − y ′)2 and pen(w) = ‖w‖1, the problem is

called Lasso (Least absolute shrinkage and selection operator)

For `(y , y ′) = log(1 + e−yu′) and pen(w) = ‖w‖1, the problem is

called `1-penalized logistic regression

Many combinations possible...

Page 45: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation

The combinations

(linear regression or logistic) + (ridge or `1)

are the most wildely used

Another penalization is

pen(w) =1

2‖w‖22 + α‖w‖1

called elastic-net, benefits from both the advantages of ridge and`1 penalization (where α ≥ 0 balances the two)

Page 46: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation

Cross-validation

Generalization is the goal of supervised learning

A trained classifier has to be generalizable. It must be ableto work on other data than the training dataset

Generalizable means “works without overfitting”

This can be achieved using cross-validation

There is no machine learning without cross-validation atsome point!

In the case of penalization, we need to choose a penalizationparameter C that leads to a model that ”generalizes” well

Page 47: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation

V-Fold cross-validation

Most standard cross-validation technique

Take V = 5 or V = 10. Pick a random partition I1, . . . , IV of{1, . . . , n}, where |Iv | ≈ n

V for any v = 1, . . . ,V

Page 48: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation

Consider a setC = {C1, . . .CK}

of possible values for C . For each v = 1, . . . ,V

Put Iv ,train = ∪v ′ 6=v Iv ′ and Iv ,test = Iv

For each C ∈ C, find

wv ,C ∈ argminw

{ 1

|Iv ,train|∑

i∈Iv,train

`(yi , 〈xi ,w〉) +1

Cpen(w)

}Take

C ∈ argminC∈C

V∑v=1

∑i∈Iv,test

`(yi , 〈xi , wv ,C 〉)

Remark: depending on the problem, we might use a different loss(or score) for choosing C

Page 49: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation

Training error:

C 7→V∑

v=1

∑i∈Iv,train

`(yi , 〈xi , wv ,C 〉)

Testing, validation or cross-validation error:

C 7→V∑

v=1

∑i∈Iv,test

`(yi , 〈xi , wv ,C 〉)

Page 50: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation

This afternoon: practical session

Text classification

Scoring

Page 51: Introduction to Machine Learning and Deep Learning...Introduction to Deep Learning Feed-forward neural networks Convolutional neural networks Stochastic gradient descent, back-propagation

Thank you!