machine learning introduction - epfllasa.epfl.ch/teaching/lectures/ml_phd/slides/ml... ·...

68
MACHINE LEARNING 2013 1 1 MACHINE LEARNING Introduction Lecturer: Prof. Aude Billard ([email protected]) Assistants: Dr. Basilio Noris, Nicolas Sommer ([email protected]; [email protected])

Upload: others

Post on 27-Jun-2020

14 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: MACHINE LEARNING Introduction - EPFLlasa.epfl.ch/teaching/lectures/ML_Phd/Slides/ML... · 2016-04-18 · Clustering: example Hierarchical clustering can be used with arbitrary sets

MACHINE LEARNING – 2013

1 1

MACHINE LEARNING

Introduction

Lecturer: Prof. Aude Billard ([email protected])

Assistants: Dr. Basilio Noris, Nicolas Sommer

([email protected]; [email protected])

Page 2: MACHINE LEARNING Introduction - EPFLlasa.epfl.ch/teaching/lectures/ML_Phd/Slides/ML... · 2016-04-18 · Clustering: example Hierarchical clustering can be used with arbitrary sets

MACHINE LEARNING – 2013

2 2

Practicalities

Alternate:

• Lectures: 9h15-11h00 + Exercises: 11h15-13h00

(in room MEB331)

• Practicals 9h15-13h00 (in room GRC02)

Page 3: MACHINE LEARNING Introduction - EPFLlasa.epfl.ch/teaching/lectures/ML_Phd/Slides/ML... · 2016-04-18 · Clustering: example Hierarchical clustering can be used with arbitrary sets

MACHINE LEARNING – 2013

3 3

Class Timetable

http://lasa.epfl.ch/teaching/lectures/ML_Phd/index.php

Page 4: MACHINE LEARNING Introduction - EPFLlasa.epfl.ch/teaching/lectures/ML_Phd/Slides/ML... · 2016-04-18 · Clustering: example Hierarchical clustering can be used with arbitrary sets

MACHINE LEARNING – 2013

4 4

Practicalities

Website of the class:

http://lasa.epfl.ch/teaching/lectures/ML_Phd

Lecture Notes

Machine Learning Techniques

Available at the Librairie Polytechnique

Course covers selected chapters of the lecture notes, see website.

Page 5: MACHINE LEARNING Introduction - EPFLlasa.epfl.ch/teaching/lectures/ML_Phd/Slides/ML... · 2016-04-18 · Clustering: example Hierarchical clustering can be used with arbitrary sets

MACHINE LEARNING – 2013

5 5

Grading

50% of the grade based on personal work. Choice between:

1. Mini-project implementing and evaluating the algorithm performance

and sensibility to parameter choices (should be done individually).

OR

2. A literature survey on a topic chosen among a list provided in class

(can be done in team of two people)

~25-30 hours of personal work, i.e. count one week of work.

50% based on final Oral Exam

20 minutes preparation

20 minutes answer on the black board

(closed book, but allowed to bring a recto-verso A4 page with

personal notes)

Page 6: MACHINE LEARNING Introduction - EPFLlasa.epfl.ch/teaching/lectures/ML_Phd/Slides/ML... · 2016-04-18 · Clustering: example Hierarchical clustering can be used with arbitrary sets

MACHINE LEARNING – 2013

6 6

Prerequisites

Linear Algebra, Probabilities and Statistics

Basics in ML can be an advantage (otherwise catch up with

lecture notes),

Lecture Notes on the website http://lasa.epfl.ch/teaching/lectures/ML_Phd/

Page 7: MACHINE LEARNING Introduction - EPFLlasa.epfl.ch/teaching/lectures/ML_Phd/Slides/ML... · 2016-04-18 · Clustering: example Hierarchical clustering can be used with arbitrary sets

MACHINE LEARNING – 2013

7 7

Syllabus

Compulsory reading of background chapters before class!

Page 8: MACHINE LEARNING Introduction - EPFLlasa.epfl.ch/teaching/lectures/ML_Phd/Slides/ML... · 2016-04-18 · Clustering: example Hierarchical clustering can be used with arbitrary sets

MACHINE LEARNING – 2013

8 8

Today’s class format

• Examples of ML applications

• Taxonomy and basic concepts of ML

• Brief recap of basic maths for the class

• Overview of practicals

Page 9: MACHINE LEARNING Introduction - EPFLlasa.epfl.ch/teaching/lectures/ML_Phd/Slides/ML... · 2016-04-18 · Clustering: example Hierarchical clustering can be used with arbitrary sets

MACHINE LEARNING – 2013

9 9

What is Machine Learning to you?

What do you think it is used for?

Why are you taking this class?!

Page 10: MACHINE LEARNING Introduction - EPFLlasa.epfl.ch/teaching/lectures/ML_Phd/Slides/ML... · 2016-04-18 · Clustering: example Hierarchical clustering can be used with arbitrary sets

MACHINE LEARNING – 2013

10 10

Machine Learning, a definition

Machine Learning is the field of scientific study that concentrates on

induction algorithms and on other algorithms that can be said to ``learn.'' Machine Learning Journal, Kluwer Academic

Machine Learning is an area of artificial intelligence involving

developing techniques to allow computers to “learn”. More specifically,

machine learning is a method for creating computer programs by the

analysis of data sets, rather than the intuition of engineers. Machine

learning overlaps heavily with statistics, since both fields study the

analysis of data. Webster Dictionary

Machine learning is a branch of statistics and computer science, which

studies algorithms and architectures that learn from data sets.

WordIQ

Page 11: MACHINE LEARNING Introduction - EPFLlasa.epfl.ch/teaching/lectures/ML_Phd/Slides/ML... · 2016-04-18 · Clustering: example Hierarchical clustering can be used with arbitrary sets

MACHINE LEARNING – 2012

11 11 11 11

What is Machine Learning?

Machine Learning encompasses a large set of algorithms that aim at

inferring information from what is hidden.

A. M. Bronstein, M. M. Bronstein, M. Zibulevsky, "On separation of semitransparent dynamic images from static background", Proc. Intl. Conf. on Independent Component Analysis

and Blind Signal Separation, pp. 934-940, 2006.

Page 12: MACHINE LEARNING Introduction - EPFLlasa.epfl.ch/teaching/lectures/ML_Phd/Slides/ML... · 2016-04-18 · Clustering: example Hierarchical clustering can be used with arbitrary sets

MACHINE LEARNING – 2012

12 12 12 12

What is Machine Learning?

Recognizing human speech.

Here this the wave produced when uttering the word “allright”.

The strength of ML algorithms is that they can apply to arbitrary set of data.

It can recognizing patterns from what from various source of data.

Page 13: MACHINE LEARNING Introduction - EPFLlasa.epfl.ch/teaching/lectures/ML_Phd/Slides/ML... · 2016-04-18 · Clustering: example Hierarchical clustering can be used with arbitrary sets

MACHINE LEARNING – 2012

13 13 13 13

What is Machine Learning?

Piano note Same note played by a oboe

Page 14: MACHINE LEARNING Introduction - EPFLlasa.epfl.ch/teaching/lectures/ML_Phd/Slides/ML... · 2016-04-18 · Clustering: example Hierarchical clustering can be used with arbitrary sets

MACHINE LEARNING – 2013

14 14

What is Machine Learning?

What is sometimes impossible to see for humans is easy for ML to pick.

Demo Eyes-No-Gaze

Demo Eyes-With-Gazes

Page 15: MACHINE LEARNING Introduction - EPFLlasa.epfl.ch/teaching/lectures/ML_Phd/Slides/ML... · 2016-04-18 · Clustering: example Hierarchical clustering can be used with arbitrary sets

MACHINE LEARNING – 2013

15 15

What is Machine Learning?

ML algorithms makes inference from analyzing a set of signals or data-

points.

Demo PCA

?

Wrinkles, Eyelids and

Eyelashes

Support

Vector

Regression

Noris et al, 2011, Computer Vision and Image Understanding.

Page 16: MACHINE LEARNING Introduction - EPFLlasa.epfl.ch/teaching/lectures/ML_Phd/Slides/ML... · 2016-04-18 · Clustering: example Hierarchical clustering can be used with arbitrary sets

MACHINE LEARNING – 2013

16 16 16

What is Machine Learning?

Conversely, things that seem evident to humans may require more than

one ML tool and also some intuition for encoding the data.

There is an ambiguity. The two sets of images are differentiable by both

orientation and color. Orientation is spurious information coming from

poor choice of training data. Color is the feature we try to teach the

algorithm.

Page 17: MACHINE LEARNING Introduction - EPFLlasa.epfl.ch/teaching/lectures/ML_Phd/Slides/ML... · 2016-04-18 · Clustering: example Hierarchical clustering can be used with arbitrary sets

MACHINE LEARNING – 2012

17 17 17 17

What is Machine Learning?

A good training set must make sure to provide enough information for

the algorithm to do proper inference. Here, one must provide images of

the two pen in the same set of orientation.

Conversely, things that seem evident to humans may require more than

one ML tool and also some intuition for encoding the data.

Page 18: MACHINE LEARNING Introduction - EPFLlasa.epfl.ch/teaching/lectures/ML_Phd/Slides/ML... · 2016-04-18 · Clustering: example Hierarchical clustering can be used with arbitrary sets

MACHINE LEARNING – 2013

18 18

Learning versus Memorization

Learning implies generalizing.

Generalizing consists of extracting key features from the data, matching

those across data (to find resemblances) and storing a generalized

representation of the data features that accounts best (according to a

given metric) for all the small differences across data. Classification and

clustering techniques are examples of methods that generalize by

categorizing the data.

Generalizing is the opposite of memorizing and often one might want to

find a tradeoff between over-generalizing, hence losing information on

the data, and over fitting, i.e. keeping more information than required.

Generalization is particularly important in order to reduce the influence

of noise, introduced in the variability of the data.

Page 19: MACHINE LEARNING Introduction - EPFLlasa.epfl.ch/teaching/lectures/ML_Phd/Slides/ML... · 2016-04-18 · Clustering: example Hierarchical clustering can be used with arbitrary sets

MACHINE LEARNING – 2013

19 19

• Supervised learning – where the algorithm learns a function or

model that maps best a set of inputs to a set of desired outputs.

• Reinforcement learning – where the algorithm learns a policy

or model of the set of transitions across a discrete set of input-output

states (Markovian world) in order to maximize a reward value (external

reinforcement).

• Unsupervised learning – where the algorithm learns a model

that best represent a set of inputs without any feedback (no desired

output, no external reinforcement)

• Learning to learn – where the algorithm learns its own inductive

bias based on previous experiences

Taxonomy in ML

Page 20: MACHINE LEARNING Introduction - EPFLlasa.epfl.ch/teaching/lectures/ML_Phd/Slides/ML... · 2016-04-18 · Clustering: example Hierarchical clustering can be used with arbitrary sets

MACHINE LEARNING – 2013

20 20

Examples of ML Applications

Page 21: MACHINE LEARNING Introduction - EPFLlasa.epfl.ch/teaching/lectures/ML_Phd/Slides/ML... · 2016-04-18 · Clustering: example Hierarchical clustering can be used with arbitrary sets

MACHINE LEARNING – 2013

21 21

Structure Discovery

Raw Data

Trying to find some structure in the data…..

Page 22: MACHINE LEARNING Introduction - EPFLlasa.epfl.ch/teaching/lectures/ML_Phd/Slides/ML... · 2016-04-18 · Clustering: example Hierarchical clustering can be used with arbitrary sets

MACHINE LEARNING – 2013

22 22

Structure Discovery: example

Methods for spectral analysis, such as linear/kernel PCA - CCA - ICA

aim at finding hidden structure in the data.

Linear PCA

Kernel PCA

projections

Projection of handwritten digits; kernel PCA projections extract better some of the texture and is less sensitive to

noise than linear PCA, which boost reconstruction and recognition of digits (Mika et al, NIPS 2000).

Reconst.

Page 23: MACHINE LEARNING Introduction - EPFLlasa.epfl.ch/teaching/lectures/ML_Phd/Slides/ML... · 2016-04-18 · Clustering: example Hierarchical clustering can be used with arbitrary sets

MACHINE LEARNING – 2013

23 23

Structure Discovery: example

Methods for spectral analysis, such as linear/kernel PCA - CCA - ICA

aim at finding hidden structure in the data.

Person identification Task:

Top row: Query image and 10 candidates in the gallery set.

Bottom row: projections of the query image onto the pre-learned (through kernel PCA)

appearance manifold of the 10 candidates.

Yang et al, Person Reidentification by Kernel PCA Based Appearance Learning,

Canadian Conf. on Computer and Robot Vision (2011)

Page 24: MACHINE LEARNING Introduction - EPFLlasa.epfl.ch/teaching/lectures/ML_Phd/Slides/ML... · 2016-04-18 · Clustering: example Hierarchical clustering can be used with arbitrary sets

MACHINE LEARNING – 2013

24 24

Structure Discovery

Spectral analysis proceeds by either projecting or lifting the data into a

lower, respectively, higher dimensional space.

In each projection, groups of datapoints appear more similar than in the

original space.

Looking at each projection separately allows to determine which feature

each group of datapoints share.

Feature space

Projections in feature space

This can be used in different ways:

- To discard outliers by selecting only the

datapoints that have most features in

common.

- To group datapoints according to

shared features.

- To rank features according to how

frequently these appear.

x F(x)

F1(x)

Page 25: MACHINE LEARNING Introduction - EPFLlasa.epfl.ch/teaching/lectures/ML_Phd/Slides/ML... · 2016-04-18 · Clustering: example Hierarchical clustering can be used with arbitrary sets

MACHINE LEARNING – 2013

25 25

In this class, we will briefly review some of the key novel algorithms for

spectral analysis, including:

- Kernel PCA with wide application of its non-linear projections for a

variety of domains;

- Kernel CCA (Canonical Correlation Analysis): Generalization of

kernel PCA to comparison across domains, e.g. combining visual

and auditory information;

- Kernel ICA that attempt to solve more complex blind source

decomposition using non-linear projections;

Structure Discovery

Page 26: MACHINE LEARNING Introduction - EPFLlasa.epfl.ch/teaching/lectures/ML_Phd/Slides/ML... · 2016-04-18 · Clustering: example Hierarchical clustering can be used with arbitrary sets

MACHINE LEARNING – 2012

26

Clustering

Clustering encompasses a large set of methods that try to find patterns

that are similar in some way.

Hierarchical clustering builds tree-like structure by pairing datapoints

according to increasing levels of similarity.

Page 27: MACHINE LEARNING Introduction - EPFLlasa.epfl.ch/teaching/lectures/ML_Phd/Slides/ML... · 2016-04-18 · Clustering: example Hierarchical clustering can be used with arbitrary sets

MACHINE LEARNING – 2013

27 27

Clustering: example

Hierarchical clustering can be used with arbitrary sets of data.

Example:

Hierarchical clustering to

discover similar temporal

pattern of crimes across

districts in India.

Chandra et al, “A Multivariate Time

Series Clustering Approach for Crime

Trends Prediction”, IEEE SMC 2008.

Page 28: MACHINE LEARNING Introduction - EPFLlasa.epfl.ch/teaching/lectures/ML_Phd/Slides/ML... · 2016-04-18 · Clustering: example Hierarchical clustering can be used with arbitrary sets

MACHINE LEARNING – 2013

28 28

Clustering: example

Clustering is used in computer vision for pre-processing and post-processing

of images

Multispectral medical image segmentation. (left: MRI-image from 1 channel) (right:

classification from a 9-cluster semi-supervised learning); Clusters should identify

patterns, such as cerebro-spinal fluid, white matter, striated muscle, tumor.

(Lundervolt et al, 1996).

Page 29: MACHINE LEARNING Introduction - EPFLlasa.epfl.ch/teaching/lectures/ML_Phd/Slides/ML... · 2016-04-18 · Clustering: example Hierarchical clustering can be used with arbitrary sets

MACHINE LEARNING – 2013

29 29

Clustering: example

Clustering assume groups of points are somewhat similar according to

the same metric of similarity.

All current clustering techniques fail at clustering the above seven

groups of points.

Jain, 2010, Data clustering: 50 years beyond K-means, Pattern Recognition Letters

Page 30: MACHINE LEARNING Introduction - EPFLlasa.epfl.ch/teaching/lectures/ML_Phd/Slides/ML... · 2016-04-18 · Clustering: example Hierarchical clustering can be used with arbitrary sets

MACHINE LEARNING – 2013

30 30

Clustering

Different techniques or heuristics can be developed to help the

algorithm determine the right boundaries across clusters:

Jain, 2010, Data clustering: 50 years beyond K-means, Pattern Recognition Letters

Page 31: MACHINE LEARNING Introduction - EPFLlasa.epfl.ch/teaching/lectures/ML_Phd/Slides/ML... · 2016-04-18 · Clustering: example Hierarchical clustering can be used with arbitrary sets

MACHINE LEARNING – 2013

31 31

Clustering

We will point out some of the emerging and useful research directions to

tackle key issues in designing clustering algorithms:

• semi-supervised clustering,

• ensemble clustering,

• simultaneous feature selection during data clustering,

• large scale data clustering.

In this class, we will briefly review some algorithms for spectral

clustering, starting with K-means and moving to advanced methods

such as Kernel K-means.

Page 32: MACHINE LEARNING Introduction - EPFLlasa.epfl.ch/teaching/lectures/ML_Phd/Slides/ML... · 2016-04-18 · Clustering: example Hierarchical clustering can be used with arbitrary sets

MACHINE LEARNING – 2013

32 32

Classification

Classification is a supervised clustering process.

Classification is usually multi-class; Given a set of known classes, the

algorithm learns to extract combinations of data features that best predict

the true class of the data.

Original Data After 4-class classification using SVM

Page 33: MACHINE LEARNING Introduction - EPFLlasa.epfl.ch/teaching/lectures/ML_Phd/Slides/ML... · 2016-04-18 · Clustering: example Hierarchical clustering can be used with arbitrary sets

MACHINE LEARNING – 2013

33 33

Classification: example

Classification of finance data to assess solvability using Support

Vector Machine (SVM).

Swiderski et al, Decision Multistage classification by using logistic regression and neural networks for

assessment of financial condition of company, Support System, 2012

5-classes

insolvency risk

Excellent, good,

satisfactory,

passable, poor

Page 34: MACHINE LEARNING Introduction - EPFLlasa.epfl.ch/teaching/lectures/ML_Phd/Slides/ML... · 2016-04-18 · Clustering: example Hierarchical clustering can be used with arbitrary sets

MACHINE LEARNING – 2013

34 34

Classification: issues

A recurrent problem when applying classification to real life problems

is that classes are often very unbalanced.

This can affect drastically classification performance, as classes with

many data points have more influence on the error measure during

training.

In this class, you will get the chance to practice this by using real

datasets during the computer-based practical session and to

discuss what one must do in case of unbalanced datasets in each

class.

Data from Swiderski et al. have more positive

examples than negative examples

Page 35: MACHINE LEARNING Introduction - EPFLlasa.epfl.ch/teaching/lectures/ML_Phd/Slides/ML... · 2016-04-18 · Clustering: example Hierarchical clustering can be used with arbitrary sets

MACHINE LEARNING – 2013

35 35

Regression

Regression is a supervised machine learning technique.

Non-linear regression techniques, such as Support Vector

Regression and Gaussian Process Regression, model the non-linear

relationships across the data.

y

1,...

Estimate that best predict set of training points , ?i i

i Mf x y

x 1x

1y

2x

2y

3x

3y

4x

4y

y f xPredict given input through a non-linear function :y x f

Page 36: MACHINE LEARNING Introduction - EPFLlasa.epfl.ch/teaching/lectures/ML_Phd/Slides/ML... · 2016-04-18 · Clustering: example Hierarchical clustering can be used with arbitrary sets

MACHINE LEARNING – 2013

36 36

Regression: example

SVR for predicting cumulative log return over a period of 2500 days.

Contrasted two methods to determine automatically the optimal

features (i.e. moving average).

Wand & Zhu, Financial market forecasting using a two-step kernel learning method for the support vector regression, Annals of Op. Research, 2010

Found that short-term (daily

and weekly) trends had a

bigger impact than the long-

term (monthly and quarterly)

trends in predicting the next

day return.

Page 37: MACHINE LEARNING Introduction - EPFLlasa.epfl.ch/teaching/lectures/ML_Phd/Slides/ML... · 2016-04-18 · Clustering: example Hierarchical clustering can be used with arbitrary sets

MACHINE LEARNING – 2013

37 37

Regression: example

SVR for predicting the optimal position and orientation of the golf

club to hit the ball in a golf experiment.

Kronander, Khansari and Billard, JTSC award, IEEE Int. Conf. on Int. and Rob. Systems 2011.

Contrast prediction of two methods (Gaussian Process Regression and

Gaussian Mixture Regression) in terms of precision and generalization.

GPR GMR

Page 38: MACHINE LEARNING Introduction - EPFLlasa.epfl.ch/teaching/lectures/ML_Phd/Slides/ML... · 2016-04-18 · Clustering: example Hierarchical clustering can be used with arbitrary sets

MACHINE LEARNING – 2013

38 38

Regression

Machine learning techniques for non-linear regression are model

free.

They estimate both the function and its parameters (density based

estimate of the data distribution).

In this class, we will:

- Compare three of the major non-linear regression techniques

- Show similarities (same mathematical framework).

- Discuss differences (parameters estimation, objective function)

- Determine which technique is best suited when.

Page 39: MACHINE LEARNING Introduction - EPFLlasa.epfl.ch/teaching/lectures/ML_Phd/Slides/ML... · 2016-04-18 · Clustering: example Hierarchical clustering can be used with arbitrary sets

MACHINE LEARNING – 2013

39 39

Machine Learning in Practice

The choice of dataset for training the algorithm is crucial

and biases strongly performance

Page 40: MACHINE LEARNING Introduction - EPFLlasa.epfl.ch/teaching/lectures/ML_Phd/Slides/ML... · 2016-04-18 · Clustering: example Hierarchical clustering can be used with arbitrary sets

MACHINE LEARNING – 2013

40 40

y

x

bXaY Regression minimizing

Mean Square Error

2

1

ˆ1

m

i

ii xyxym

MSE

Estimating from sampling the datapoints

y

Sampling

Page 41: MACHINE LEARNING Introduction - EPFLlasa.epfl.ch/teaching/lectures/ML_Phd/Slides/ML... · 2016-04-18 · Clustering: example Hierarchical clustering can be used with arbitrary sets

MACHINE LEARNING – 2013

41 41

y

x

Regression minimizing

Mean Square Error

2

1

ˆ1

m

i

ii xyxym

MSE

Estimating from sampling the datapoints

y

y

The choice of training data (training set) is crucial Crossvalidation

bXaY

Page 42: MACHINE LEARNING Introduction - EPFLlasa.epfl.ch/teaching/lectures/ML_Phd/Slides/ML... · 2016-04-18 · Clustering: example Hierarchical clustering can be used with arbitrary sets

MACHINE LEARNING – 2013

42 42

ML in Practice: Training and Evaluation

Best practice to assess the validity of a Machine Learning algorithm is to

measure its performance against the training, validation and testing sets.

These sets are built from partitioning the data set at hand.

Training Set

Validation

Set

Testing

Set

Crossvalidation

Training and validation sets are used to

determine the sensitivity of the learning to the

choice of hyperparameters (i.e. parameters not

learned during training). Values for the

hyperparameters are set through a grid search.

Once the optimal hyperparameters have been

picked, the model is trained with complete

training + validation set and tested on the testing

set.

In practice, one often uses solely training and

testing sets and performs crossvalidation directly

on these.

Crossvalidation

Page 43: MACHINE LEARNING Introduction - EPFLlasa.epfl.ch/teaching/lectures/ML_Phd/Slides/ML... · 2016-04-18 · Clustering: example Hierarchical clustering can be used with arbitrary sets

MACHINE LEARNING – 2013

43 43

Choice of training / testing ratio

Avoid overfitting

Train the classifier with a small sample of all datapoints and

test it with the remaining datapoints.

Typical choice of training/testing set ratio is 2/3rd training, 1/3rd testing.

The smaller the ratio, the more robust the classification

N-fold crossvalidation

Repeats the procedure N times by picking randomly points from the

dataset to create the training set.

Typical choice is 10-fold crossvalidation, although this should depend on

the number of datapoints you have!

ML in Practice: Training and Evaluation

Page 44: MACHINE LEARNING Introduction - EPFLlasa.epfl.ch/teaching/lectures/ML_Phd/Slides/ML... · 2016-04-18 · Clustering: example Hierarchical clustering can be used with arbitrary sets

MACHINE LEARNING – 2013

44 44

Time

Performance

How long can it take before an acceptable level of

performance is achieved?

What would be an optimal learning curve?

When is good enough achieved?

Progress in a machine’s performance must be measurable and must

be significant. A machine must eventually reach a minimal level of

performance (“good enough”) within an acceptable time frame.

Performance measures in ML

Page 45: MACHINE LEARNING Introduction - EPFLlasa.epfl.ch/teaching/lectures/ML_Phd/Slides/ML... · 2016-04-18 · Clustering: example Hierarchical clustering can be used with arbitrary sets

MACHINE LEARNING – 2013

45 45

Performance measures in ML

These vary and depend entirely on the algorithm and the function

you wish to optimize.

Performance measure for supervised learning algorithms are well

defined and relate directly to a distance measure between the

desired output and the estimated one.

In classification, people tend to compute the performance in terms

of % of items correctly classified. This can be very misleading if

instances of each class are not well balanced and if there is a lot of

variation in classification across classes. The old-fashioned

measures of mean, median, std remain very reliable measures of

performance.

Page 46: MACHINE LEARNING Introduction - EPFLlasa.epfl.ch/teaching/lectures/ML_Phd/Slides/ML... · 2016-04-18 · Clustering: example Hierarchical clustering can be used with arbitrary sets

MACHINE LEARNING – 2013

46 46

Performance measures in ML

Performance often depends on choosing well parameters, e.g. threshold

in classification: e.g. in naïve Bayes classification

Bayes rule for binary classification:

x has class label +1 if 1|

else x has class label -1

P y x

X

1|P y x

Page 47: MACHINE LEARNING Introduction - EPFLlasa.epfl.ch/teaching/lectures/ML_Phd/Slides/ML... · 2016-04-18 · Clustering: example Hierarchical clustering can be used with arbitrary sets

MACHINE LEARNING – 2013

47 47

Performance measures in ML: Ground truth

1/ Comparing the performance of a novel method to existing ones or trivial

baselines is crucial. Try to have the « ground truth ». This often means

hand-coded solutions (assuming humans outperform the machine).

2/ Using the testing set, even in a way that seems reasonnable is always

dangerous. It is extremely hard to predict how much it artificially improves

the estimated performance.

Page 48: MACHINE LEARNING Introduction - EPFLlasa.epfl.ch/teaching/lectures/ML_Phd/Slides/ML... · 2016-04-18 · Clustering: example Hierarchical clustering can be used with arbitrary sets

MACHINE LEARNING – 2013

48 48

Some Machine Learning Resources

http://www.machinelearning.org/index.html

• http://www.pascal-network.org/ Network of excellence on Pattern Recognition, Statistical Modelling

and Computational Learning (summer schools and workshops)

Databases:

•http://expdb.cs.kuleuven.be/expdb/index.php

•http://archive.ics.uci.edu/ml/

Journals:

• Machine Learning Journal, Kluwer Publisher

• IEEE Transactions on Signal processing

• IEEE Transactions on Pattern Analysis

• IEEE Transactions on Pattern Recognition

• The Journal of Machine Learning Research

Conferences:

• ICML: int. conf. on machine learning

• Neural Information Processing Conference – on-line repository of all research papers,

www.nips.org

Page 49: MACHINE LEARNING Introduction - EPFLlasa.epfl.ch/teaching/lectures/ML_Phd/Slides/ML... · 2016-04-18 · Clustering: example Hierarchical clustering can be used with arbitrary sets

MACHINE LEARNING – 2013

49 49

Topics for Literature survey and Mini-Projects

Topics for survey will entail:

- Survey of clustering methods applied to finances

- Survey of classification methods applied to biometric data

The exact list of topics for lit. survey and mini-project

will be posted by March 8

Topics for mini-project will entail implementing either of these:

- Clustering techniques (DBSCAN, FLAME, KMEANS++)

- Regression (Gradient Boosting, Locally Weighted Regression)

The exact list of topics for lit. survey and mini-project

will be posted in the second week of March

Page 50: MACHINE LEARNING Introduction - EPFLlasa.epfl.ch/teaching/lectures/ML_Phd/Slides/ML... · 2016-04-18 · Clustering: example Hierarchical clustering can be used with arbitrary sets

MACHINE LEARNING – 2013

50 50

Overview of Practicals

Page 51: MACHINE LEARNING Introduction - EPFLlasa.epfl.ch/teaching/lectures/ML_Phd/Slides/ML... · 2016-04-18 · Clustering: example Hierarchical clustering can be used with arbitrary sets

MACHINE LEARNING – 2013

51 51

Brief recap of basic maths for the class

Page 52: MACHINE LEARNING Introduction - EPFLlasa.epfl.ch/teaching/lectures/ML_Phd/Slides/ML... · 2016-04-18 · Clustering: example Hierarchical clustering can be used with arbitrary sets

MACHINE LEARNING – 2013

52 52

Math Background Needed for this Class

Probability, Statistics: covariance, pdf, ….

Linear Algebra: formal notation, matrix inversion, …

Derivatives: partial derivatives, gradient, Jacobian, …

Optimization: normalized/weighted MSQ,

Lagrange multipliers, …

Page 53: MACHINE LEARNING Introduction - EPFLlasa.epfl.ch/teaching/lectures/ML_Phd/Slides/ML... · 2016-04-18 · Clustering: example Hierarchical clustering can be used with arbitrary sets

MACHINE LEARNING – 2013

53 53

1

: the probability that the variable x takes value x ,

0 1, 1,..., ,

and 1.

Idem for , 1,...

i i

i

M

i

i

j

P x x

P x x i M

P x x

P y y j N

Discrete Probabilities

Consider two variables x and y taking discrete values over the intervals

[x1,…, xM] and [y1,…, yN] respectively.

Page 54: MACHINE LEARNING Introduction - EPFLlasa.epfl.ch/teaching/lectures/ML_Phd/Slides/ML... · 2016-04-18 · Clustering: example Hierarchical clustering can be used with arbitrary sets

MACHINE LEARNING – 2013

54 54

The joint probability that the two events A (variable x takes value xi) and B (variable y

takes value yj) occur is expressed as:

P(A | B) is the conditional probability that event A will take place if event B already

took place

|

P A BP A B

P B

, i jP A B P A B P x x y y

||

P B A P AP A B

P B

Bayes' theorem:

Discrete Probabilities

Page 55: MACHINE LEARNING Introduction - EPFLlasa.epfl.ch/teaching/lectures/ML_Phd/Slides/ML... · 2016-04-18 · Clustering: example Hierarchical clustering can be used with arbitrary sets

MACHINE LEARNING – 2013

55 55

The so-called marginal probability that variable x will take value xi is given by:

Discrete Probabilities

1

( ) : ( , )N

x i i j

j

P x x P x x y y

Page 56: MACHINE LEARNING Introduction - EPFLlasa.epfl.ch/teaching/lectures/ML_Phd/Slides/ML... · 2016-04-18 · Clustering: example Hierarchical clustering can be used with arbitrary sets

MACHINE LEARNING – 2013

56 56

( ) 0,

( ) 1

p x x

p x dx

Probability Distributions, Density Functions

p(x) a continuous function is the probability density function or probability distribution

function (PDF) (sometimes also called probability distribution or simply density) of

variable x.

The pdf is not bounded by 1.

It can grow unbounded, depending on

the value taken by x.

p(x

)

x

Page 57: MACHINE LEARNING Introduction - EPFLlasa.epfl.ch/teaching/lectures/ML_Phd/Slides/ML... · 2016-04-18 · Clustering: example Hierarchical clustering can be used with arbitrary sets

MACHINE LEARNING – 2012

57 57 57

57

Probability Distributions, Density Functions

( ) : ( ) ( )b

a

x aP a x b D a x b p x dx

b a

The probability that the variable x takes a value in the subinterval [a,b] is given by:

The cumulative distribution function (or simply distribution function) of X is:

( )D x p x dx

p(x) dx ~ probability of x to fall within an infinitesimal interval [x, x + dx]

( )d

p x D xdx

D(x

)

p(x

)

x x

Page 58: MACHINE LEARNING Introduction - EPFLlasa.epfl.ch/teaching/lectures/ML_Phd/Slides/ML... · 2016-04-18 · Clustering: example Hierarchical clustering can be used with arbitrary sets

MACHINE LEARNING – 2013

58 58

Parametric PDF

The Gaussian function is entirely determined by its mean and variance.

For this reason, it is often referred to as a parametric distribution.

For other pdf, the variance represents a notion of dispersion around the

expected value.

Page 59: MACHINE LEARNING Introduction - EPFLlasa.epfl.ch/teaching/lectures/ML_Phd/Slides/ML... · 2016-04-18 · Clustering: example Hierarchical clustering can be used with arbitrary sets

MACHINE LEARNING – 2013

59 59

Expectation

When x takes discrete values: ( )

For continuous distributions: ( )

i i

i

E x x P x

E x x p x dx

The expectation of the probability P(x) (in the discrete case) and of the

pdf p(x) (in the continuous case), also called the expected value or

mean, is the average value weighted by p(x):

Page 60: MACHINE LEARNING Introduction - EPFLlasa.epfl.ch/teaching/lectures/ML_Phd/Slides/ML... · 2016-04-18 · Clustering: example Hierarchical clustering can be used with arbitrary sets

MACHINE LEARNING – 2013

60 60

Variance

222 2( )Var x E x E x E x

2 , the variance of a distribution measures the amount of spread of the

distribution around its mean:

is the standard deviation of x.

Page 61: MACHINE LEARNING Introduction - EPFLlasa.epfl.ch/teaching/lectures/ML_Phd/Slides/ML... · 2016-04-18 · Clustering: example Hierarchical clustering can be used with arbitrary sets

MACHINE LEARNING – 2013

61 61

Mean and variance in PDF

For other pdf than Gaussian distribution, the variance represents a

notion of dispersion around the expected value.

-4 -3 -2 -1 0 1 2 3 40

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

*

xf(

0)+

f(1

)+f(

-2)

expectation

std=1.38

3 Gaussians distributions Resulting distribution when superposing the

3 Gaussian distributions.

Page 62: MACHINE LEARNING Introduction - EPFLlasa.epfl.ch/teaching/lectures/ML_Phd/Slides/ML... · 2016-04-18 · Clustering: example Hierarchical clustering can be used with arbitrary sets

MACHINE LEARNING – 2013

62 62

Probability Distributions, Density Functions

2

221, μ:mean, σ:variance

2

x

p x e

The uni-dimensional Gaussian or Normal distribution is a distribution with pdf given by:

The Gaussian function is entirely determined by its mean and variance.

For this reason, it is often referred to as a parametric distribution.

Page 63: MACHINE LEARNING Introduction - EPFLlasa.epfl.ch/teaching/lectures/ML_Phd/Slides/ML... · 2016-04-18 · Clustering: example Hierarchical clustering can be used with arbitrary sets

MACHINE LEARNING – 2012

63 63 63

( , )xp x p x y dy

Marginal, Likelihood

Consider two random variables x and y with joint distribution p(x,y), then the marginal

probability of x given y is:

Page 64: MACHINE LEARNING Introduction - EPFLlasa.epfl.ch/teaching/lectures/ML_Phd/Slides/ML... · 2016-04-18 · Clustering: example Hierarchical clustering can be used with arbitrary sets

MACHINE LEARNING – 2013

64 64

( , )xp x p x y dy

Marginal, Likelihood

Consider two random variables x and y with joint distribution p(x,y), then the marginal

probability of x given y is:

Consider that the pdf of x, y is parametrized, s.t. one can compute the conditional

Then, the likelihood function (short – likelihood) of the model parameters

is given by:

, | ,p x y

,

, | , : , | ,L x y p x y

Page 65: MACHINE LEARNING Introduction - EPFLlasa.epfl.ch/teaching/lectures/ML_Phd/Slides/ML... · 2016-04-18 · Clustering: example Hierarchical clustering can be used with arbitrary sets

MACHINE LEARNING – 2013

65 65

Maximum Likelihood

Machine learning techniques often assume that the form of the distribution function is

known and that sole its parameters must be optimized to fit at best a set of observed

datapoints. It then proceeds to determine these parameters through maximum

likelihood optimization.

The principle of maximum likelihood consists of finding the optimal parameters of a

given distribution by maximizing the likelihood function of these parameters,

equivalently by maximizing the probability of the data given the model and its

parameters, e.g.:

, ,max , | max | ,

| , 0 and | , 0

L x p x

p x p x

If p is the Gaussian function, then the above has an analytical solution (assuming

that one has enough observations of x to draw from).

Page 66: MACHINE LEARNING Introduction - EPFLlasa.epfl.ch/teaching/lectures/ML_Phd/Slides/ML... · 2016-04-18 · Clustering: example Hierarchical clustering can be used with arbitrary sets

MACHINE LEARNING – 2013

66 66

ML in Practice : Caveats on Statistical Measures

A large number of algorithms we will see in class require knowing the

mean and covariance of the probability distribution function of the data.

In practice, the class means and covariances are not known. They can,

however, be estimated from the training set. Either the maximum

likelihood estimate or the maximum a posteriori estimate may be used in

place of the exact value.

Several of the algorithms to estimate these assume that the underlying

distribution follows a normal distribution. This is usually not true. Thus,

one should keep in mind that, although the estimates of the covariance

may be considered optimal, this does not mean that the resulting

computation obtained by substituting these values is optimal, even if the

assumption of normally distributed classes is correct.

Page 67: MACHINE LEARNING Introduction - EPFLlasa.epfl.ch/teaching/lectures/ML_Phd/Slides/ML... · 2016-04-18 · Clustering: example Hierarchical clustering can be used with arbitrary sets

MACHINE LEARNING – 2013

67 67

Another complication that you will often encounter when dealing with

algorithms that require computing the inverse of the covariance matrix of

the data is that, with real data, the number of observations of each sample

exceeds the number of samples. In this case, the covariance estimates do

not have full rank, i.e. they cannot be inverted.

There are a number of ways to deal with this. One is to use the

pseudoinverse of the covariance matrix. Another way is to proceed to

singular value decomposition (SVD).

ML in Practice : Caveats on Statistical Measures

Page 68: MACHINE LEARNING Introduction - EPFLlasa.epfl.ch/teaching/lectures/ML_Phd/Slides/ML... · 2016-04-18 · Clustering: example Hierarchical clustering can be used with arbitrary sets

MACHINE LEARNING – 2013

68 68

Recall that Sections 2.1-2.2 of the Lecture Notes

must be read before coming to class next week