introduction to machine learning€¦ · introductiontomachinelearning gillesgasso insa rouen - asi...

Introduction to Machine Learning

Gilles Gasso

INSA Rouen - ASI DepartementLITIS Lab

September 5, 2019

Gilles Gasso Introduction to Machine Learning 1 / 32

Machine Learning: introduction

Machine Learning ≡ data-based programmingability of computers to learn how to perform tasks (classification,detection, translation. . . ) without being explicitly programmed

study of algorithms that improve their performance at some task basedon experience


The rise

DataBig Data : continuous increase in data generated

Twitter : 50M tweets /day (=7 terabytes)Facebook : 10 terabytes /dayYoutube : 50h of uploaded videos /minute2.9 millions of e-mails /second

Computing PowerMoore’s lawMassively distributedcomputing

The value-adding processInterest: from product to customers.Data Mining ≡ discovering patterns inlarge data sets


Historical perspective

Today’s IA is Deep Learning (a technique of Machine Learning)

https://blog.alore.io/machine-learning-and-artificial-intelligence/


https://blog.alore.io/machine-learning-and-artificial-intelligence/

Applications: sentiment analysis

Classify stored reviews according to users’ sentiment

Score(x) > 0

Score(x) < 0


Product recommendation

Product features

User features

Matrix factorization Purchase history of customers

https://katbailey.github.io/images/matrix_factorization.png

Purchased item Recommended products


https://katbailey.github.io/images/matrix_factorization.png

Image classification

Labeled training images Deep classification architecture

https://www.researchgate.net/profile/Y_Nikitin/publication/270163511/figure/download/fig5/AS:295194831409153@1447391340221/MPCNN-architecture-using-alternating-convolutional-and-max-pooling-layers-13.png

Input images and predicted category

https://adriancolyer.files.wordpress.com/2016/04/imagenet-fig4l.png?w=656




https://adriancolyer.files.wordpress.com/2016/04/imagenet-fig4l.png?w=656

Medical diagnosis

Malignant melanoma detection130 000 images including over 2 000 cases of cancerError rate 28 % (human 34 %)

Digital Mammography DREAM Challenge640 000 mammographies (1209 participants)false-positive rate decreased by 5 %

Heart rhythm analysis500 000 ECGaccuracy 92.6 % (human 80.0 %) sensitivity of 97 %


Object detection

https://www.mdpi.com/applsci/applsci-09-01128/article_deploy/html/images/applsci-09-01128-g004.pn

https://arxiv.org/pdf/1506.02640.pdf


https://www.mdpi.com/applsci/applsci-09-01128/article_deploy/html/images/applsci-09-01128-g004.pn

https://arxiv.org/pdf/1506.02640.pdf

Audio captioning

https://ars.els-cdn.com/content/image/1-s2.0-S1574954115000151-gr1.jpg


https://ars.els-cdn.com/content/image/1-s2.0-S1574954115000151-gr1.jpg

Chatbot

https://miro.medium.com/max/2058/1*LF5T9fsr4w2EqyFJkb-gng.png


https://miro.medium.com/max/2058/1*LF5T9fsr4w2EqyFJkb-gng.png

Implementing a Machine Learning project

Real World

IndustriesInsuranceInternet

Social NetworksMedical

. . .

DataGathering

SensorsTransactions

Web-clicks,logsMobility

Open dataDocs

. . .

DataStrength-ening

CleaningOrganizingAggregationManagement

of errors

. . .

Dataengineering

ExplorationRepresentation

DisplayMachine Learning

Data Mining

. . .

Business

InterpretationPattern ofinterests

Strategic decisionValuation

. . .


Chain of the data engineering process

DataPre-

processingTrain themodel

Evaluatethe

modelModel

1 Understand and specify projectgoals

2 Pre-processing/visualize/analyzedata

3 Which ML problem is it?4 Design a solving approach5 Evaluate its performance6 Go to to 2) if needed

1

1

1

1 1

1

1

1

1

1

1

Y

? 1

1

Course Goal: Study the steps from 2 to 5


The data

Information (past experience) are examples with attributesAssume the data set consists of N samples

AttributesAn attribute is a property or characteristic of a phenomenon beingobserved. Also called it feature or variable

SampleIt is an entity characterising an object; it is made up of attributes.Synonyms : instance, point, vector (usually in Rd)


Data : visualizationhhhhhhhhhPoints x

Features citric acid residual sugar chlorides sulfur dioxide

1 0 1.9 0.076 112 0 2.6 0.098 253 0.04 2.3 0.092 15

Point x ∈ R4 0.56 1.9 0.075 175 0 1.9 0.076 116 0 1.8 0.075 137 0.06 1.6 0.069 158 0.02 2 0.073 99 0.36 2.8 0.071 1710 0.08 1.8 0.097 15

1.5 2 2.5 30.065

0.07

0.075

0.08

0.085

0.09

0.095

0.1

Variable 2 : Residual Sugar

Variable

3 : C

hlo

rides

Points

Moyenne des points

1.61.8

22.2

2.42.6

2.8

0.06

0.07

0.08

0.09

0.1

10

15

20

25

Variable 2 : Residual Sugar

Variable 3 : Chlorides

Variable

4 : S

ulfur


Data types

Sensors → Quantitative andqualitative variables,ordinales, nominals

Text → StringSpeech → Time SeriesImages → 2D DataVideos → 2D Data + time

Networks → GraphsStream → Logs, coupons. . .Labels → Expected output prediction


Approaches of Machine Learning


Supervised learning

Principle

Given a set of N training examples {(xi , yi ) ∈ X × Y, i = · · · ,N}, wewant to estimate a prediction function y = f (x).

The supervision comes from the label knowledge

ExamplesImage classification, object detection, stock price prediction . . .


Unsupervised learning

Principle

Only the {xi ∈ X , i = · · · ,N} are available. We aim to describe howdata is organized and extract homogeneous subsets from it.

ExamplesApplications: Customers segmentation, image segmentation, datavisualization, categorization of similar documents . . .

-60 -40 -20 0 20 40 60 80-80

-60

-40

-20

0

20

40

60

80

00

00

00 00

000

0

0

0 0

000

0

00

00

00

0

0

0

0 0

0

00

0 0

000

00000

0

00 00 00

00

0

00

0

0

000 00

000

00

0000

0

00 0

0

000

00

0000

0000

0

0

0000

0

00

00

1

111

1

1

1

11

11

111

11

11

1

11

1

11

1

11

11

11

1

1

1

1111

111

11111

111

1

1

11

1

111

11

11

1

1

1

1

111

1

11

1 11

11

11 1

1

1

1

1

1

1

1

11

11

1

1

1

11

1

1

1 1

1

2

22

2

2

2

22

2

2 22

2

2

2

2

222 2

222

2

2

2

2

22 2

2

222

2

22

2

2 2

2

2

22

2

2

22

2 2

222

2

222

2

2

2

2

2

2

222

22

2

2

2

2

2

2 2

2

2

2

2

2

2

22222

2

2

2 2

2

22222

2

2

22

3 3

3

3

3

3

33

3 33

33

3

3

33

333

3

333

3

33

3

3

33 3

3

3

33

3

3

3

3

3

33

3

3

3

3

3

33

3

3

3

3

333

3

3

33 33

3

3

3333

3

3

33

3

3

3

3

3

333

33

3

3

33

3

3

3

33 33

3

3

33

33

4

4

4

4

4

44

4

4

4

4

4 4444

4

4

4

4

4

4

4

444

4

44

44

4

4

4

4 444

44

4

44

4

44

4

4

4

4

4444

4

4

4

4

4

4

44

444 4

4

4

44

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

44

44

4

4

4

44 4

44

5

555555

5

5 55

5

5

5

55

5

5

5

5

5

5

5

5

5

55

55

5

55

5

5

5

5

5

5

5

5

5

5

5

5

5

555

5

5

5

5

5

55

5

55

5

55

5

5

55

555

5

5

5

55555

55

5

5

5

5

5

5

5

5

5

5

5

5

5

5

5

5

5

5

55

5

56

6

6

66

6 6666

666 6

6 666

6

66 66

6

6

6

6

66

6

66

6 6

6

66

66

6 66

6

6

6

66 6

6 66

6

6

6

66

6

6

6

66

66666666

6

6

66

66 6

66

6 66

6

6

6

6

6

6

66

66

6

6

6

66

6

66

6

77

77

777

7

7

7

777

7

7

7

7

7

77

7

7

7

77 7

7

7

7

777

7

7

7

77

7

77

7

77

7

7

7

7

7

7

7

7

7

7

7

7

7

7

7

7

777

7

7

77

7

77

7

7

7

7

77

77

7

77

7

77

7

7

7

7 77

7

77

7

7 7

77 7

77

88

8 888 888

8

88 8

88

88

8

8 8

8

8

88

88

8

8

8

8

8

888

8

888 8

8

88

8888 8

8

8

8

8

8

88 88

8

8

88888

888 88 8

8

8 88

8 888

888

8

88

8

8

88

8

8

8

8

8

8

8 8

88

888

999

9

99

9

9

9

9

9

99

99

9

9

9

9

9

9

99

9

99

9

99

9

9

9

9

9

9999

999

9

99999

9

99

9

9

9

9

99

9

99

999

9 99

9

99

99 9 9

9

9 99

9

9 9

9

9

9

9

9

999

9

9

9

9

9

9

9

99

99

9

9


Supervised learning : concept

Let X and Y be two sets. Assume p(X ,Y ) the joint probabilitydistribution of (X ,Y ) ∈ X × Y.

Goal : find a prediction function f : X → Y which correctly estimatesthe output y corresponding to x .

f belongs to a space H called hypothesis class. Example of H : setof polynomial functions


Supervised learning: principle

Loss function L(Y , f (X ))

evaluates how ”close” is the prediction f (x) to the true label y

it penalizes errors: L(y , f (x)) ={

0 if y = f (x)≥ 0 if y 6= f (x)

True risk (expected prediction error)

R(f ) = E(X ,Y )[L(Y , f (X ))] =

∫X×Y

L(y , f (x))p(x , y)dxdy

ObjectiveIdentify the prediction function which minimizes the true risk i.e.

f ? = arg minf ∈H

R(f )


Loss functions

Quadratic loss : L(Y , f (X )) = (Y − f (X ))2

`1 loss (absolute deviation): L(Y , f (X )) = |Y − f (X )|

0− 1 loss: L(y , f (x)) ={

0 if y = f (x)1 otherwise

Hinge loss: L(y , f (x)) = max(0, 1− yf (x))


Some supervised learning problems

RegressionWe talk about regression whenY is a subset of Rd .

Usual related loss function:quadratic loss (y − f (x))2

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5−1.5

−1

−0.5

0

0.5

1Support Vector Machine Regression

x

y


Some supervised learning problems

ClassificationOutput space Y is an un-ordered discrete setBinary classification: card(Y) = 2

Example: Y = {−1, 1}Loss functions: 0-1 loss, hinge loss

Multiclass classification: card(Y) > 2Example: Y = {1, 2, · · · ,K}


From true risk to empirical risk

Minimizing the true risk is not duable (in allmost all practical applications)

The joint distribution p(X ,Y ) is unknown!

Only a finite training set {(xi , yi ) ∈ X × Y}Ni=1 is available

Empirical risk

Remp(f ) =1N

N∑i=1

L(yi , f (xi ))

Empirical risk minimization

f̂ = arg minf ∈H

Remp(f )


Overfitting

OverfittingEmpirical risk is not appropriate for model selection: if H is large enough,Remp(f )→ 0 but the generalized error (true risk) is high.

Err

eur

de

pre

dic

tio

n

Ensemble de Test

Ensemble d’apprentissage

Faible ElevéComplexité du modèle


Illustration of overfitting

Valeurs decroissantes de k0102030

Ris

que

empi

rique

0

0.1

0.2

0.3

AppValTest

https://www.cs.princeton.edu/courses/archive/spring16/cos495/slides/ML_basics_lecture6_overfitting.pdf





Model selection

Find in H the best function f that learned based on the training setwill well generalize (low true risk)

Example : We are looking for a polynomial function of degree αminimizing the risk : Remp(fα) =

∑Ni=1(yi − fα(xi ))

2.

Goal :1 propose a model estimation method in order to choose (approximately)

the best model belonging to H .2 once the model is selected, estimate its generalization error.


Model selection : basic approache

Case 1 : N is really big (large scale DN )

Validation Apprentissage {X app ,Y app }

Données disponibles

Test {X test ,Y test }{X val ,Y val}

1 Randomly split DN = Dtrain ∪ Dval ∪ Dtest

2 For each α, train fα based on Dtrain

3 Evaluating its performance on Dval Rval =1

Nval

∑i∈Dval

L(yi , f (xi ))

4 Select the model with the best performance on Dval

5 Test selected model on Dtest

NoteDtest is used once!


Model selection : Cross-validation

Case 2 : Small or medium scale DN

Estimate the generalization error by re-sampling.

Principle

1 Split DN into K sets of equal size.

2 For each k = 1, · · · ,K , train a model by using the K − 1 remainingsets and evaluate the model on the k-th part.

3 Average the K error estimates obtained to have the cross-validationerror.

TestApprentissageValidation

Validation TestApprentissage Apprentissage

Validation TestApprentissage


Conclusions

To successfully carry out an automatic data processing projectClearly identify and spell out the needs.

Create or obtain data representative of the problem

Identify the context of learning

Analyze and reduce data size

Choose an algorithm and/or a space of hypotheses

Choose a model by applying the algorithm to pre-processed data

Validate the performance of the method


Au final ...

Find your way. . .


introduction to machine learning€¦ · introductiontomachinelearning gillesgasso insa rouen - asi...

Documents