machine learning introduction - epfllasa.epfl.ch/teaching/lectures/ml_phd/slides/ml... ·...
TRANSCRIPT
MACHINE LEARNING – 2013
1 1
MACHINE LEARNING
Introduction
Lecturer: Prof. Aude Billard ([email protected])
Assistants: Dr. Basilio Noris, Nicolas Sommer
MACHINE LEARNING – 2013
2 2
Practicalities
Alternate:
• Lectures: 9h15-11h00 + Exercises: 11h15-13h00
(in room MEB331)
• Practicals 9h15-13h00 (in room GRC02)
MACHINE LEARNING – 2013
3 3
Class Timetable
http://lasa.epfl.ch/teaching/lectures/ML_Phd/index.php
MACHINE LEARNING – 2013
4 4
Practicalities
Website of the class:
http://lasa.epfl.ch/teaching/lectures/ML_Phd
Lecture Notes
Machine Learning Techniques
Available at the Librairie Polytechnique
Course covers selected chapters of the lecture notes, see website.
MACHINE LEARNING – 2013
5 5
Grading
50% of the grade based on personal work. Choice between:
1. Mini-project implementing and evaluating the algorithm performance
and sensibility to parameter choices (should be done individually).
OR
2. A literature survey on a topic chosen among a list provided in class
(can be done in team of two people)
~25-30 hours of personal work, i.e. count one week of work.
50% based on final Oral Exam
20 minutes preparation
20 minutes answer on the black board
(closed book, but allowed to bring a recto-verso A4 page with
personal notes)
MACHINE LEARNING – 2013
6 6
Prerequisites
Linear Algebra, Probabilities and Statistics
Basics in ML can be an advantage (otherwise catch up with
lecture notes),
Lecture Notes on the website http://lasa.epfl.ch/teaching/lectures/ML_Phd/
MACHINE LEARNING – 2013
7 7
Syllabus
Compulsory reading of background chapters before class!
MACHINE LEARNING – 2013
8 8
Today’s class format
• Examples of ML applications
• Taxonomy and basic concepts of ML
• Brief recap of basic maths for the class
• Overview of practicals
MACHINE LEARNING – 2013
9 9
What is Machine Learning to you?
What do you think it is used for?
Why are you taking this class?!
MACHINE LEARNING – 2013
10 10
Machine Learning, a definition
Machine Learning is the field of scientific study that concentrates on
induction algorithms and on other algorithms that can be said to ``learn.'' Machine Learning Journal, Kluwer Academic
Machine Learning is an area of artificial intelligence involving
developing techniques to allow computers to “learn”. More specifically,
machine learning is a method for creating computer programs by the
analysis of data sets, rather than the intuition of engineers. Machine
learning overlaps heavily with statistics, since both fields study the
analysis of data. Webster Dictionary
Machine learning is a branch of statistics and computer science, which
studies algorithms and architectures that learn from data sets.
WordIQ
MACHINE LEARNING – 2012
11 11 11 11
What is Machine Learning?
Machine Learning encompasses a large set of algorithms that aim at
inferring information from what is hidden.
A. M. Bronstein, M. M. Bronstein, M. Zibulevsky, "On separation of semitransparent dynamic images from static background", Proc. Intl. Conf. on Independent Component Analysis
and Blind Signal Separation, pp. 934-940, 2006.
MACHINE LEARNING – 2012
12 12 12 12
What is Machine Learning?
Recognizing human speech.
Here this the wave produced when uttering the word “allright”.
The strength of ML algorithms is that they can apply to arbitrary set of data.
It can recognizing patterns from what from various source of data.
MACHINE LEARNING – 2012
13 13 13 13
What is Machine Learning?
Piano note Same note played by a oboe
MACHINE LEARNING – 2013
14 14
What is Machine Learning?
What is sometimes impossible to see for humans is easy for ML to pick.
Demo Eyes-No-Gaze
Demo Eyes-With-Gazes
MACHINE LEARNING – 2013
15 15
What is Machine Learning?
ML algorithms makes inference from analyzing a set of signals or data-
points.
Demo PCA
?
Wrinkles, Eyelids and
Eyelashes
Support
Vector
Regression
Noris et al, 2011, Computer Vision and Image Understanding.
MACHINE LEARNING – 2013
16 16 16
What is Machine Learning?
Conversely, things that seem evident to humans may require more than
one ML tool and also some intuition for encoding the data.
There is an ambiguity. The two sets of images are differentiable by both
orientation and color. Orientation is spurious information coming from
poor choice of training data. Color is the feature we try to teach the
algorithm.
MACHINE LEARNING – 2012
17 17 17 17
What is Machine Learning?
A good training set must make sure to provide enough information for
the algorithm to do proper inference. Here, one must provide images of
the two pen in the same set of orientation.
Conversely, things that seem evident to humans may require more than
one ML tool and also some intuition for encoding the data.
MACHINE LEARNING – 2013
18 18
Learning versus Memorization
Learning implies generalizing.
Generalizing consists of extracting key features from the data, matching
those across data (to find resemblances) and storing a generalized
representation of the data features that accounts best (according to a
given metric) for all the small differences across data. Classification and
clustering techniques are examples of methods that generalize by
categorizing the data.
Generalizing is the opposite of memorizing and often one might want to
find a tradeoff between over-generalizing, hence losing information on
the data, and over fitting, i.e. keeping more information than required.
Generalization is particularly important in order to reduce the influence
of noise, introduced in the variability of the data.
MACHINE LEARNING – 2013
19 19
• Supervised learning – where the algorithm learns a function or
model that maps best a set of inputs to a set of desired outputs.
• Reinforcement learning – where the algorithm learns a policy
or model of the set of transitions across a discrete set of input-output
states (Markovian world) in order to maximize a reward value (external
reinforcement).
• Unsupervised learning – where the algorithm learns a model
that best represent a set of inputs without any feedback (no desired
output, no external reinforcement)
• Learning to learn – where the algorithm learns its own inductive
bias based on previous experiences
Taxonomy in ML
MACHINE LEARNING – 2013
20 20
Examples of ML Applications
MACHINE LEARNING – 2013
21 21
Structure Discovery
Raw Data
Trying to find some structure in the data…..
MACHINE LEARNING – 2013
22 22
Structure Discovery: example
Methods for spectral analysis, such as linear/kernel PCA - CCA - ICA
aim at finding hidden structure in the data.
Linear PCA
Kernel PCA
projections
Projection of handwritten digits; kernel PCA projections extract better some of the texture and is less sensitive to
noise than linear PCA, which boost reconstruction and recognition of digits (Mika et al, NIPS 2000).
Reconst.
MACHINE LEARNING – 2013
23 23
Structure Discovery: example
Methods for spectral analysis, such as linear/kernel PCA - CCA - ICA
aim at finding hidden structure in the data.
Person identification Task:
Top row: Query image and 10 candidates in the gallery set.
Bottom row: projections of the query image onto the pre-learned (through kernel PCA)
appearance manifold of the 10 candidates.
Yang et al, Person Reidentification by Kernel PCA Based Appearance Learning,
Canadian Conf. on Computer and Robot Vision (2011)
MACHINE LEARNING – 2013
24 24
Structure Discovery
Spectral analysis proceeds by either projecting or lifting the data into a
lower, respectively, higher dimensional space.
In each projection, groups of datapoints appear more similar than in the
original space.
Looking at each projection separately allows to determine which feature
each group of datapoints share.
Feature space
Projections in feature space
This can be used in different ways:
- To discard outliers by selecting only the
datapoints that have most features in
common.
- To group datapoints according to
shared features.
- To rank features according to how
frequently these appear.
x F(x)
F1(x)
MACHINE LEARNING – 2013
25 25
In this class, we will briefly review some of the key novel algorithms for
spectral analysis, including:
- Kernel PCA with wide application of its non-linear projections for a
variety of domains;
- Kernel CCA (Canonical Correlation Analysis): Generalization of
kernel PCA to comparison across domains, e.g. combining visual
and auditory information;
- Kernel ICA that attempt to solve more complex blind source
decomposition using non-linear projections;
Structure Discovery
MACHINE LEARNING – 2012
26
Clustering
Clustering encompasses a large set of methods that try to find patterns
that are similar in some way.
Hierarchical clustering builds tree-like structure by pairing datapoints
according to increasing levels of similarity.
MACHINE LEARNING – 2013
27 27
Clustering: example
Hierarchical clustering can be used with arbitrary sets of data.
Example:
Hierarchical clustering to
discover similar temporal
pattern of crimes across
districts in India.
Chandra et al, “A Multivariate Time
Series Clustering Approach for Crime
Trends Prediction”, IEEE SMC 2008.
MACHINE LEARNING – 2013
28 28
Clustering: example
Clustering is used in computer vision for pre-processing and post-processing
of images
Multispectral medical image segmentation. (left: MRI-image from 1 channel) (right:
classification from a 9-cluster semi-supervised learning); Clusters should identify
patterns, such as cerebro-spinal fluid, white matter, striated muscle, tumor.
(Lundervolt et al, 1996).
MACHINE LEARNING – 2013
29 29
Clustering: example
Clustering assume groups of points are somewhat similar according to
the same metric of similarity.
All current clustering techniques fail at clustering the above seven
groups of points.
Jain, 2010, Data clustering: 50 years beyond K-means, Pattern Recognition Letters
MACHINE LEARNING – 2013
30 30
Clustering
Different techniques or heuristics can be developed to help the
algorithm determine the right boundaries across clusters:
Jain, 2010, Data clustering: 50 years beyond K-means, Pattern Recognition Letters
MACHINE LEARNING – 2013
31 31
Clustering
We will point out some of the emerging and useful research directions to
tackle key issues in designing clustering algorithms:
• semi-supervised clustering,
• ensemble clustering,
• simultaneous feature selection during data clustering,
• large scale data clustering.
In this class, we will briefly review some algorithms for spectral
clustering, starting with K-means and moving to advanced methods
such as Kernel K-means.
MACHINE LEARNING – 2013
32 32
Classification
Classification is a supervised clustering process.
Classification is usually multi-class; Given a set of known classes, the
algorithm learns to extract combinations of data features that best predict
the true class of the data.
Original Data After 4-class classification using SVM
MACHINE LEARNING – 2013
33 33
Classification: example
Classification of finance data to assess solvability using Support
Vector Machine (SVM).
Swiderski et al, Decision Multistage classification by using logistic regression and neural networks for
assessment of financial condition of company, Support System, 2012
5-classes
insolvency risk
Excellent, good,
satisfactory,
passable, poor
MACHINE LEARNING – 2013
34 34
Classification: issues
A recurrent problem when applying classification to real life problems
is that classes are often very unbalanced.
This can affect drastically classification performance, as classes with
many data points have more influence on the error measure during
training.
In this class, you will get the chance to practice this by using real
datasets during the computer-based practical session and to
discuss what one must do in case of unbalanced datasets in each
class.
Data from Swiderski et al. have more positive
examples than negative examples
MACHINE LEARNING – 2013
35 35
Regression
Regression is a supervised machine learning technique.
Non-linear regression techniques, such as Support Vector
Regression and Gaussian Process Regression, model the non-linear
relationships across the data.
y
1,...
Estimate that best predict set of training points , ?i i
i Mf x y
x 1x
1y
2x
2y
3x
3y
4x
4y
y f xPredict given input through a non-linear function :y x f
MACHINE LEARNING – 2013
36 36
Regression: example
SVR for predicting cumulative log return over a period of 2500 days.
Contrasted two methods to determine automatically the optimal
features (i.e. moving average).
Wand & Zhu, Financial market forecasting using a two-step kernel learning method for the support vector regression, Annals of Op. Research, 2010
Found that short-term (daily
and weekly) trends had a
bigger impact than the long-
term (monthly and quarterly)
trends in predicting the next
day return.
MACHINE LEARNING – 2013
37 37
Regression: example
SVR for predicting the optimal position and orientation of the golf
club to hit the ball in a golf experiment.
Kronander, Khansari and Billard, JTSC award, IEEE Int. Conf. on Int. and Rob. Systems 2011.
Contrast prediction of two methods (Gaussian Process Regression and
Gaussian Mixture Regression) in terms of precision and generalization.
GPR GMR
MACHINE LEARNING – 2013
38 38
Regression
Machine learning techniques for non-linear regression are model
free.
They estimate both the function and its parameters (density based
estimate of the data distribution).
In this class, we will:
- Compare three of the major non-linear regression techniques
- Show similarities (same mathematical framework).
- Discuss differences (parameters estimation, objective function)
- Determine which technique is best suited when.
MACHINE LEARNING – 2013
39 39
Machine Learning in Practice
The choice of dataset for training the algorithm is crucial
and biases strongly performance
MACHINE LEARNING – 2013
40 40
y
x
bXaY Regression minimizing
Mean Square Error
2
1
ˆ1
m
i
ii xyxym
MSE
Estimating from sampling the datapoints
y
Sampling
MACHINE LEARNING – 2013
41 41
y
x
Regression minimizing
Mean Square Error
2
1
ˆ1
m
i
ii xyxym
MSE
Estimating from sampling the datapoints
y
y
The choice of training data (training set) is crucial Crossvalidation
bXaY
MACHINE LEARNING – 2013
42 42
ML in Practice: Training and Evaluation
Best practice to assess the validity of a Machine Learning algorithm is to
measure its performance against the training, validation and testing sets.
These sets are built from partitioning the data set at hand.
Training Set
Validation
Set
Testing
Set
Crossvalidation
Training and validation sets are used to
determine the sensitivity of the learning to the
choice of hyperparameters (i.e. parameters not
learned during training). Values for the
hyperparameters are set through a grid search.
Once the optimal hyperparameters have been
picked, the model is trained with complete
training + validation set and tested on the testing
set.
In practice, one often uses solely training and
testing sets and performs crossvalidation directly
on these.
Crossvalidation
MACHINE LEARNING – 2013
43 43
Choice of training / testing ratio
Avoid overfitting
Train the classifier with a small sample of all datapoints and
test it with the remaining datapoints.
Typical choice of training/testing set ratio is 2/3rd training, 1/3rd testing.
The smaller the ratio, the more robust the classification
N-fold crossvalidation
Repeats the procedure N times by picking randomly points from the
dataset to create the training set.
Typical choice is 10-fold crossvalidation, although this should depend on
the number of datapoints you have!
ML in Practice: Training and Evaluation
MACHINE LEARNING – 2013
44 44
Time
Performance
How long can it take before an acceptable level of
performance is achieved?
What would be an optimal learning curve?
When is good enough achieved?
Progress in a machine’s performance must be measurable and must
be significant. A machine must eventually reach a minimal level of
performance (“good enough”) within an acceptable time frame.
Performance measures in ML
MACHINE LEARNING – 2013
45 45
Performance measures in ML
These vary and depend entirely on the algorithm and the function
you wish to optimize.
Performance measure for supervised learning algorithms are well
defined and relate directly to a distance measure between the
desired output and the estimated one.
In classification, people tend to compute the performance in terms
of % of items correctly classified. This can be very misleading if
instances of each class are not well balanced and if there is a lot of
variation in classification across classes. The old-fashioned
measures of mean, median, std remain very reliable measures of
performance.
MACHINE LEARNING – 2013
46 46
Performance measures in ML
Performance often depends on choosing well parameters, e.g. threshold
in classification: e.g. in naïve Bayes classification
Bayes rule for binary classification:
x has class label +1 if 1|
else x has class label -1
P y x
X
1|P y x
MACHINE LEARNING – 2013
47 47
Performance measures in ML: Ground truth
1/ Comparing the performance of a novel method to existing ones or trivial
baselines is crucial. Try to have the « ground truth ». This often means
hand-coded solutions (assuming humans outperform the machine).
2/ Using the testing set, even in a way that seems reasonnable is always
dangerous. It is extremely hard to predict how much it artificially improves
the estimated performance.
MACHINE LEARNING – 2013
48 48
Some Machine Learning Resources
http://www.machinelearning.org/index.html
• http://www.pascal-network.org/ Network of excellence on Pattern Recognition, Statistical Modelling
and Computational Learning (summer schools and workshops)
Databases:
•http://expdb.cs.kuleuven.be/expdb/index.php
•http://archive.ics.uci.edu/ml/
Journals:
• Machine Learning Journal, Kluwer Publisher
• IEEE Transactions on Signal processing
• IEEE Transactions on Pattern Analysis
• IEEE Transactions on Pattern Recognition
• The Journal of Machine Learning Research
Conferences:
• ICML: int. conf. on machine learning
• Neural Information Processing Conference – on-line repository of all research papers,
www.nips.org
MACHINE LEARNING – 2013
49 49
Topics for Literature survey and Mini-Projects
Topics for survey will entail:
- Survey of clustering methods applied to finances
- Survey of classification methods applied to biometric data
The exact list of topics for lit. survey and mini-project
will be posted by March 8
Topics for mini-project will entail implementing either of these:
- Clustering techniques (DBSCAN, FLAME, KMEANS++)
- Regression (Gradient Boosting, Locally Weighted Regression)
The exact list of topics for lit. survey and mini-project
will be posted in the second week of March
MACHINE LEARNING – 2013
50 50
Overview of Practicals
MACHINE LEARNING – 2013
51 51
Brief recap of basic maths for the class
MACHINE LEARNING – 2013
52 52
Math Background Needed for this Class
Probability, Statistics: covariance, pdf, ….
Linear Algebra: formal notation, matrix inversion, …
Derivatives: partial derivatives, gradient, Jacobian, …
Optimization: normalized/weighted MSQ,
Lagrange multipliers, …
MACHINE LEARNING – 2013
53 53
1
: the probability that the variable x takes value x ,
0 1, 1,..., ,
and 1.
Idem for , 1,...
i i
i
M
i
i
j
P x x
P x x i M
P x x
P y y j N
Discrete Probabilities
Consider two variables x and y taking discrete values over the intervals
[x1,…, xM] and [y1,…, yN] respectively.
MACHINE LEARNING – 2013
54 54
The joint probability that the two events A (variable x takes value xi) and B (variable y
takes value yj) occur is expressed as:
P(A | B) is the conditional probability that event A will take place if event B already
took place
|
P A BP A B
P B
, i jP A B P A B P x x y y
||
P B A P AP A B
P B
Bayes' theorem:
Discrete Probabilities
MACHINE LEARNING – 2013
55 55
The so-called marginal probability that variable x will take value xi is given by:
Discrete Probabilities
1
( ) : ( , )N
x i i j
j
P x x P x x y y
MACHINE LEARNING – 2013
56 56
( ) 0,
( ) 1
p x x
p x dx
Probability Distributions, Density Functions
p(x) a continuous function is the probability density function or probability distribution
function (PDF) (sometimes also called probability distribution or simply density) of
variable x.
The pdf is not bounded by 1.
It can grow unbounded, depending on
the value taken by x.
p(x
)
x
MACHINE LEARNING – 2012
57 57 57
57
Probability Distributions, Density Functions
( ) : ( ) ( )b
a
x aP a x b D a x b p x dx
b a
The probability that the variable x takes a value in the subinterval [a,b] is given by:
The cumulative distribution function (or simply distribution function) of X is:
( )D x p x dx
p(x) dx ~ probability of x to fall within an infinitesimal interval [x, x + dx]
( )d
p x D xdx
D(x
)
p(x
)
x x
MACHINE LEARNING – 2013
58 58
Parametric PDF
The Gaussian function is entirely determined by its mean and variance.
For this reason, it is often referred to as a parametric distribution.
For other pdf, the variance represents a notion of dispersion around the
expected value.
MACHINE LEARNING – 2013
59 59
Expectation
When x takes discrete values: ( )
For continuous distributions: ( )
i i
i
E x x P x
E x x p x dx
The expectation of the probability P(x) (in the discrete case) and of the
pdf p(x) (in the continuous case), also called the expected value or
mean, is the average value weighted by p(x):
MACHINE LEARNING – 2013
60 60
Variance
222 2( )Var x E x E x E x
2 , the variance of a distribution measures the amount of spread of the
distribution around its mean:
is the standard deviation of x.
MACHINE LEARNING – 2013
61 61
Mean and variance in PDF
For other pdf than Gaussian distribution, the variance represents a
notion of dispersion around the expected value.
-4 -3 -2 -1 0 1 2 3 40
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
*
xf(
0)+
f(1
)+f(
-2)
expectation
std=1.38
3 Gaussians distributions Resulting distribution when superposing the
3 Gaussian distributions.
MACHINE LEARNING – 2013
62 62
Probability Distributions, Density Functions
2
221, μ:mean, σ:variance
2
x
p x e
The uni-dimensional Gaussian or Normal distribution is a distribution with pdf given by:
The Gaussian function is entirely determined by its mean and variance.
For this reason, it is often referred to as a parametric distribution.
MACHINE LEARNING – 2012
63 63 63
( , )xp x p x y dy
Marginal, Likelihood
Consider two random variables x and y with joint distribution p(x,y), then the marginal
probability of x given y is:
MACHINE LEARNING – 2013
64 64
( , )xp x p x y dy
Marginal, Likelihood
Consider two random variables x and y with joint distribution p(x,y), then the marginal
probability of x given y is:
Consider that the pdf of x, y is parametrized, s.t. one can compute the conditional
Then, the likelihood function (short – likelihood) of the model parameters
is given by:
, | ,p x y
,
, | , : , | ,L x y p x y
MACHINE LEARNING – 2013
65 65
Maximum Likelihood
Machine learning techniques often assume that the form of the distribution function is
known and that sole its parameters must be optimized to fit at best a set of observed
datapoints. It then proceeds to determine these parameters through maximum
likelihood optimization.
The principle of maximum likelihood consists of finding the optimal parameters of a
given distribution by maximizing the likelihood function of these parameters,
equivalently by maximizing the probability of the data given the model and its
parameters, e.g.:
, ,max , | max | ,
| , 0 and | , 0
L x p x
p x p x
If p is the Gaussian function, then the above has an analytical solution (assuming
that one has enough observations of x to draw from).
MACHINE LEARNING – 2013
66 66
ML in Practice : Caveats on Statistical Measures
A large number of algorithms we will see in class require knowing the
mean and covariance of the probability distribution function of the data.
In practice, the class means and covariances are not known. They can,
however, be estimated from the training set. Either the maximum
likelihood estimate or the maximum a posteriori estimate may be used in
place of the exact value.
Several of the algorithms to estimate these assume that the underlying
distribution follows a normal distribution. This is usually not true. Thus,
one should keep in mind that, although the estimates of the covariance
may be considered optimal, this does not mean that the resulting
computation obtained by substituting these values is optimal, even if the
assumption of normally distributed classes is correct.
MACHINE LEARNING – 2013
67 67
Another complication that you will often encounter when dealing with
algorithms that require computing the inverse of the covariance matrix of
the data is that, with real data, the number of observations of each sample
exceeds the number of samples. In this case, the covariance estimates do
not have full rank, i.e. they cannot be inverted.
There are a number of ways to deal with this. One is to use the
pseudoinverse of the covariance matrix. Another way is to proceed to
singular value decomposition (SVD).
ML in Practice : Caveats on Statistical Measures
MACHINE LEARNING – 2013
68 68
Recall that Sections 2.1-2.2 of the Lecture Notes
must be read before coming to class next week