lecture 1: introduction to machine learning
DESCRIPTION
Lecture 1: Introduction to Machine Learning . Isabelle Guyon [email protected]. What is Machine Learning?. Learning algorithm. Trained machine. TRAINING DATA. Answer. ?. Query. What for?. Classification Time series prediction Regression Clustering. Market Analysis. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Lecture 1: Introduction to Machine Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062501/5681634b550346895dd3dccb/html5/thumbnails/1.jpg)
Lecture 1: Introduction to
Machine Learning Isabelle Guyon
![Page 2: Lecture 1: Introduction to Machine Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062501/5681634b550346895dd3dccb/html5/thumbnails/2.jpg)
What is Machine Learning?
Learningalgorithm
TRAININGDATA Answer
Trainedmachine
Query
![Page 3: Lecture 1: Introduction to Machine Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062501/5681634b550346895dd3dccb/html5/thumbnails/3.jpg)
What for?
• Classification• Time series prediction• Regression• Clustering
![Page 4: Lecture 1: Introduction to Machine Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062501/5681634b550346895dd3dccb/html5/thumbnails/4.jpg)
Applications
inputs
training examples
10
102
103
104
105
Bioinformatics
Ecology
OCRHWR
MarketAnalysis
TextCategorization
Machine Vision
Syst
em d
iagn
osis
10 102 103 104 105
![Page 5: Lecture 1: Introduction to Machine Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062501/5681634b550346895dd3dccb/html5/thumbnails/5.jpg)
Banking / Telecom / Retail
• Identify:– Prospective customers– Dissatisfied customers– Good customers– Bad payers
• Obtain:– More effective advertising– Less credit risk– Fewer fraud– Decreased churn rate
![Page 6: Lecture 1: Introduction to Machine Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062501/5681634b550346895dd3dccb/html5/thumbnails/6.jpg)
Biomedical / Biometrics
• Medicine:– Screening– Diagnosis and prognosis– Drug discovery
• Security:– Face recognition– Signature / fingerprint / iris
verification– DNA fingerprinting 6
![Page 7: Lecture 1: Introduction to Machine Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062501/5681634b550346895dd3dccb/html5/thumbnails/7.jpg)
Computer / Internet
• Computer interfaces:– Troubleshooting wizards – Handwriting and speech– Brain waves
• Internet– Hit ranking– Spam filtering– Text categorization– Text translation– Recommendation
7
![Page 8: Lecture 1: Introduction to Machine Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062501/5681634b550346895dd3dccb/html5/thumbnails/8.jpg)
Conventions
X={xij}
n
mxi
y ={yj}
w
![Page 9: Lecture 1: Introduction to Machine Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062501/5681634b550346895dd3dccb/html5/thumbnails/9.jpg)
Learning problem
Colon cancer, Alon et al 1999
Unsupervised learningIs there structure in data?
Supervised learningPredict an outcome y.
Data matrix: X
m lines = patterns (data points, examples): samples, patients, documents, images, …
n columns = features: (attributes, input variables): genes, proteins, words, pixels, …
![Page 10: Lecture 1: Introduction to Machine Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062501/5681634b550346895dd3dccb/html5/thumbnails/10.jpg)
Some Learning Machines
• Linear models • Kernel methods• Neural networks• Decision trees
![Page 11: Lecture 1: Introduction to Machine Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062501/5681634b550346895dd3dccb/html5/thumbnails/11.jpg)
Linear Models
• f(x) = w x +b = j=1:n wj xj +b Linearity in the parameters, NOT in the input
components.
• f(x) = w (x) +b = j wj j(x) +b (Perceptron)
• f(x) = i=1:m i k(xi,x) +b (Kernel method)
![Page 12: Lecture 1: Introduction to Machine Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062501/5681634b550346895dd3dccb/html5/thumbnails/12.jpg)
Artificial Neurons
x1
x2
xn
1
f(x)
w1
w2
wn
b
f(x) = w x + b
Axon
Synapses
Activation of other neurons Dendrites
Cell potential
Activation function
McCulloch and Pitts, 1943
![Page 13: Lecture 1: Introduction to Machine Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062501/5681634b550346895dd3dccb/html5/thumbnails/13.jpg)
Linear Decision Boundary
-0.50
0.5-0.5
00.5
-0.5
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
X1X2
X3
x1x2
x3
hyperplane
x1
x2
![Page 14: Lecture 1: Introduction to Machine Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062501/5681634b550346895dd3dccb/html5/thumbnails/14.jpg)
Perceptron
Rosenblatt, 1957
f(x)
f(x) = w (x) + b
1(x)
1
x1
x2
xn
2(x)
N(x)
w1
w2
wN
b
![Page 15: Lecture 1: Introduction to Machine Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062501/5681634b550346895dd3dccb/html5/thumbnails/15.jpg)
NL Decision Boundary
x1
x2
-0.5
0
0.5
-0.5
0
0.5-0.5
0
0.5
Hs.128749Hs.234680
Hs.
7780
x1
x2
x3
![Page 16: Lecture 1: Introduction to Machine Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062501/5681634b550346895dd3dccb/html5/thumbnails/16.jpg)
Kernel MethodPotential functions, Aizerman et al 1964
f(x) = i i k(xi,x) + b
k(x1,x)
1
x1
x2
xn
1
2
m
b
k(x2,x)
k(xm,x)
k(. ,. ) is a similarity measure or “kernel”.
![Page 17: Lecture 1: Introduction to Machine Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062501/5681634b550346895dd3dccb/html5/thumbnails/17.jpg)
A kernel is:• a similarity measure• a dot product in some feature space: k(s, t) = (s) (t)
But we do not need to know the representation.
Examples:• k(s, t) = exp(-||s-t||2/2) Gaussian kernel
• k(s, t) = (s t)q Polynomial kernel
What is a Kernel?
![Page 18: Lecture 1: Introduction to Machine Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062501/5681634b550346895dd3dccb/html5/thumbnails/18.jpg)
Hebb’s Rule
wj wj + yi xij
Axon
yxj wj
Synapse
Activation of another neuron
Dendrite
Link to “Naïve Bayes”
![Page 19: Lecture 1: Introduction to Machine Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062501/5681634b550346895dd3dccb/html5/thumbnails/19.jpg)
Kernel “Trick” (for Hebb’s rule)
• Hebb’s rule for the Perceptron:
w = i yi (xi)
f(x) = w (x) = i yi (xi) (x)
• Define a dot product: k(xi,x) = (xi) (x)
f(x) = i yi k(xi,x)
![Page 20: Lecture 1: Introduction to Machine Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062501/5681634b550346895dd3dccb/html5/thumbnails/20.jpg)
Kernel “Trick” (general)
• f(x) = i i k(xi, x)
• k(xi, x) = (xi) (x)
• f(x) = w (x)
• w = i i (xi)
Dual forms
![Page 21: Lecture 1: Introduction to Machine Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062501/5681634b550346895dd3dccb/html5/thumbnails/21.jpg)
f(x) = i k(xi, x)
k(xi, x) = (xi).(x)
Potential Function algorithmi i + yi if yif(xi)<0(Aizerman et al 1964)
Dual minoveri i + yi for min yif(xi)
Dual LMSi i + (yi - f(xi))
Simple Kernel Methods
f(x) = w • (x)
Perceptron algorithm w w + yi (xi) if yif(xi)<0(Rosenblatt 1958)
Minover (optimum margin)w w + yi (xi) for min yif(xi)(Krauth-Mézard 1987)
LMS regressionw w + yi- f(xi)) (xi)
iw = i (xi)i
(ancestor of SVM 1992, similar to kernel Adatron, 1998,
and SMO, 1999)
![Page 22: Lecture 1: Introduction to Machine Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062501/5681634b550346895dd3dccb/html5/thumbnails/22.jpg)
Multi-Layer PerceptronBack-propagation, Rumelhart et al, 1986
xj
“hidden units”
internal “latent” variables
![Page 23: Lecture 1: Introduction to Machine Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062501/5681634b550346895dd3dccb/html5/thumbnails/23.jpg)
Chessboard Problem
![Page 24: Lecture 1: Introduction to Machine Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062501/5681634b550346895dd3dccb/html5/thumbnails/24.jpg)
Tree ClassifiersCART (Breiman, 1984) or C4.5 (Quinlan, 1993)
At each step, choose the feature that
“reduces entropy” most. Work towards “node purity”.
All the data
f1
f2
Choose f2
Choose f1
![Page 25: Lecture 1: Introduction to Machine Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062501/5681634b550346895dd3dccb/html5/thumbnails/25.jpg)
Iris Data (Fisher, 1936)
Linear discriminant Tree classifier
Gaussian mixture Kernel method (SVM)
setosavirginica
versicolor
Figure from Norbert Jankowski and Krzysztof Grabczewski
![Page 26: Lecture 1: Introduction to Machine Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062501/5681634b550346895dd3dccb/html5/thumbnails/26.jpg)
x1
x2
Fit / Robustness Tradeoff
x1
x2
15
![Page 27: Lecture 1: Introduction to Machine Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062501/5681634b550346895dd3dccb/html5/thumbnails/27.jpg)
x1
x2
Performance evaluation
x1
x2
f(x) = 0
f(x) > 0
f(x) < 0
f(x) = 0
f(x) > 0
f(x) < 0
![Page 28: Lecture 1: Introduction to Machine Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062501/5681634b550346895dd3dccb/html5/thumbnails/28.jpg)
x1
x2
x1
x2
f(x) = -1
f(x) > -1
f(x) < -1f(x) = -1
f(x) > -1
f(x) < -1
Performance evaluation
![Page 29: Lecture 1: Introduction to Machine Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062501/5681634b550346895dd3dccb/html5/thumbnails/29.jpg)
x1
x2
x1
x2
f(x) = 1
f(x) > 1
f(x) < 1
f(x) = 1
f(x) > 1
f(x) < 1
Performance evaluation
![Page 30: Lecture 1: Introduction to Machine Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062501/5681634b550346895dd3dccb/html5/thumbnails/30.jpg)
ROC Curve
100%
100%
For a given threshold on f(x), you get a point on the ROC curve. Actu
al ROC
0
Positive class success rate
(hit rate, sensitivity)
1 - negative class success rate (false alarm rate, 1-specificity)
Random ROC
Ideal ROC curve
![Page 31: Lecture 1: Introduction to Machine Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062501/5681634b550346895dd3dccb/html5/thumbnails/31.jpg)
ROC Curve
Ideal ROC curve (AUC=1)
100%
100%
0 AUC 1
Actual R
OC
Random ROC (AUC=0.5)
0
For a given threshold on f(x), you get a point on the ROC curve.
Positive class success rate
(hit rate, sensitivity)
1 - negative class success rate (false alarm rate, 1-specificity)
![Page 32: Lecture 1: Introduction to Machine Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062501/5681634b550346895dd3dccb/html5/thumbnails/32.jpg)
What is a Risk Functional?
A function of the parameters of the learning machine, assessing how much it is expected to fail on a given task.
Examples:• Classification:
– Error rate: (1/m) i=1:m 1(F(xi)yi)– 1- AUC
• Regression: – Mean square error: (1/m) i=1:m(f(xi)-yi)2
![Page 33: Lecture 1: Introduction to Machine Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062501/5681634b550346895dd3dccb/html5/thumbnails/33.jpg)
How to train?
• Define a risk functional R[f(x,w)]• Optimize it w.r.t. w (gradient descent,
mathematical programming, simulated annealing, genetic algorithms, etc.)
Parameter space (w)
R[f(x,w)]
w*(… to be continued in the next lecture)
![Page 34: Lecture 1: Introduction to Machine Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062501/5681634b550346895dd3dccb/html5/thumbnails/34.jpg)
Summary• With linear threshold units (“neurons”) we can build:
– Linear discriminant (including Naïve Bayes)– Kernel methods– Neural networks– Decision trees
• The architectural hyper-parameters may include:– The choice of basis functions (features)– The kernel – The number of units
• Learning means fitting:– Parameters (weights)– Hyper-parameters– Be aware of the fit vs. robustness tradeoff
![Page 35: Lecture 1: Introduction to Machine Learning](https://reader036.vdocuments.mx/reader036/viewer/2022062501/5681634b550346895dd3dccb/html5/thumbnails/35.jpg)
Want to Learn More?
• Pattern Classification, R. Duda, P. Hart, and D. Stork. Standard pattern recognition textbook. Limited to classification problems. Matlab code. http://rii.ricoh.com/~stork/DHS.html
• The Elements of statistical Learning: Data Mining, Inference, and Prediction. T. Hastie, R. Tibshirani, J. Friedman, Standard statistics textbook. Includes all the standard machine learning methods for classification, regression, clustering. R code. http://www-stat-class.stanford.edu/~tibs/ElemStatLearn/
• Linear Discriminants and Support Vector Machines, I. Guyon and D. Stork, In Smola et al Eds. Advances in Large Margin Classiers. Pages 147--169, MIT Press, 2000. http://clopinet.com/isabelle/Papers/guyon_stork_nips98.ps.gz
• Feature Extraction: Foundations and Applications. I. Guyon et al, Eds. Book for practitioners with datasets of NIPS 2003 challenge, tutorials, best performing methods, Matlab code, teaching material. http://clopinet.com/fextract-book