introduction to statistical learning theory · introduction to statistical learning theory julia...
TRANSCRIPT
![Page 1: Introduction to Statistical Learning Theory · Introduction to Statistical Learning Theory Julia Kempe & David S. Rosenberg NYU CDS January 29, 2019 Julia Kempe & David S. Rosenberg](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e14db33828aa47b9f53ea43/html5/thumbnails/1.jpg)
Introduction to Statistical Learning Theory
Julia Kempe & David S. Rosenberg
NYU CDS
January 29, 2019
Julia Kempe & David S. Rosenberg (NYU CDS) DS-GA 1003 / CSCI-GA 2567 January 29, 2019 1 / 29
![Page 2: Introduction to Statistical Learning Theory · Introduction to Statistical Learning Theory Julia Kempe & David S. Rosenberg NYU CDS January 29, 2019 Julia Kempe & David S. Rosenberg](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e14db33828aa47b9f53ea43/html5/thumbnails/2.jpg)
Decision Theory: High Level View
Julia Kempe & David S. Rosenberg (NYU CDS) DS-GA 1003 / CSCI-GA 2567 January 29, 2019 2 / 29
![Page 3: Introduction to Statistical Learning Theory · Introduction to Statistical Learning Theory Julia Kempe & David S. Rosenberg NYU CDS January 29, 2019 Julia Kempe & David S. Rosenberg](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e14db33828aa47b9f53ea43/html5/thumbnails/3.jpg)
What types of problems are we solving?
In data science problems, we generally need to:
Make a decisionTake an actionProduce some output
Have some evaluation criterion
Julia Kempe & David S. Rosenberg (NYU CDS) DS-GA 1003 / CSCI-GA 2567 January 29, 2019 3 / 29
![Page 4: Introduction to Statistical Learning Theory · Introduction to Statistical Learning Theory Julia Kempe & David S. Rosenberg NYU CDS January 29, 2019 Julia Kempe & David S. Rosenberg](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e14db33828aa47b9f53ea43/html5/thumbnails/4.jpg)
Actions
De�nition
An action is the generic term for what is produced by our system.
Examples of Actions
Produce a 0/1 classi�cation [classical ML]
Reject hypothesis that θ= 0 [classical Statistics]
Written English text [image captioning, speech recognition, machine translation ]
What's an action for predicting where a storm will be in 3 hours?
What's an action for a self-driving car?
Julia Kempe & David S. Rosenberg (NYU CDS) DS-GA 1003 / CSCI-GA 2567 January 29, 2019 4 / 29
![Page 5: Introduction to Statistical Learning Theory · Introduction to Statistical Learning Theory Julia Kempe & David S. Rosenberg NYU CDS January 29, 2019 Julia Kempe & David S. Rosenberg](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e14db33828aa47b9f53ea43/html5/thumbnails/5.jpg)
Evaluation Criterion
Decision theory is about �nding �optimal� actions, under various de�nitions of optimality.
Examples of Evaluation Criteria
Is classi�cation correct?
Does text transcription exactly match the spoken words?
Should we give partial credit? How?
How far is the storm from the prediction location? [for point prediction]
How likely is the storm's location under the prediction? [for density prediction]
Julia Kempe & David S. Rosenberg (NYU CDS) DS-GA 1003 / CSCI-GA 2567 January 29, 2019 5 / 29
![Page 6: Introduction to Statistical Learning Theory · Introduction to Statistical Learning Theory Julia Kempe & David S. Rosenberg NYU CDS January 29, 2019 Julia Kempe & David S. Rosenberg](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e14db33828aa47b9f53ea43/html5/thumbnails/6.jpg)
Real Life: Formalizing a Business Problem
First two steps to formalizing a problem:1 De�ne the action space (i.e. the set of possible actions)2 Specify the evaluation criterion.
Formalization may evolve gradually, as you understand the problem better
Julia Kempe & David S. Rosenberg (NYU CDS) DS-GA 1003 / CSCI-GA 2567 January 29, 2019 6 / 29
![Page 7: Introduction to Statistical Learning Theory · Introduction to Statistical Learning Theory Julia Kempe & David S. Rosenberg NYU CDS January 29, 2019 Julia Kempe & David S. Rosenberg](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e14db33828aa47b9f53ea43/html5/thumbnails/7.jpg)
Inputs
Most problems have an extra piece, going by various names:
Inputs [ML]
Covariates [Statistics]
Examples of Inputs
A picture
A storm's historical location and other weather data
A search query
Julia Kempe & David S. Rosenberg (NYU CDS) DS-GA 1003 / CSCI-GA 2567 January 29, 2019 7 / 29
![Page 8: Introduction to Statistical Learning Theory · Introduction to Statistical Learning Theory Julia Kempe & David S. Rosenberg NYU CDS January 29, 2019 Julia Kempe & David S. Rosenberg](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e14db33828aa47b9f53ea43/html5/thumbnails/8.jpg)
�Outcomes� or �Output� or �Label�
Inputs often paired with outputs or outcomes or labels
Examples of outcomes/outputs/labels
Whether or not the picture actually contains an animal
The storm's location one hour after query
Which, if any, of suggested the URLs were selected
Julia Kempe & David S. Rosenberg (NYU CDS) DS-GA 1003 / CSCI-GA 2567 January 29, 2019 8 / 29
![Page 9: Introduction to Statistical Learning Theory · Introduction to Statistical Learning Theory Julia Kempe & David S. Rosenberg NYU CDS January 29, 2019 Julia Kempe & David S. Rosenberg](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e14db33828aa47b9f53ea43/html5/thumbnails/9.jpg)
Typical Sequence of Events
Many problem domains can be formalized as follows:
1 Observe input x .
2 Take action a.
3 Observe outcome y .
4 Evaluate action in relation to the outcome (via a loss function`(a,y))
Note
Outcome y is often independent of action a
But this is not always the case:
search result rankingautomated driving
Julia Kempe & David S. Rosenberg (NYU CDS) DS-GA 1003 / CSCI-GA 2567 January 29, 2019 9 / 29
![Page 10: Introduction to Statistical Learning Theory · Introduction to Statistical Learning Theory Julia Kempe & David S. Rosenberg NYU CDS January 29, 2019 Julia Kempe & David S. Rosenberg](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e14db33828aa47b9f53ea43/html5/thumbnails/10.jpg)
Formalization: The Spaces
The Three Spaces:
Input space: X
Action space: A
Outcome space: Y
Concept check:
What are the spaces for linear regression?
What are the spaces for logistic regression?
What are the spaces for a support vector machine?
Julia Kempe & David S. Rosenberg (NYU CDS) DS-GA 1003 / CSCI-GA 2567 January 29, 2019 10 / 29
![Page 11: Introduction to Statistical Learning Theory · Introduction to Statistical Learning Theory Julia Kempe & David S. Rosenberg NYU CDS January 29, 2019 Julia Kempe & David S. Rosenberg](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e14db33828aa47b9f53ea43/html5/thumbnails/11.jpg)
Some Formalization
The Spaces
X: input space Y: outcome space A: action space
Prediction Function (or �decision function�)
A prediction function (or decision function) gets input x ∈X and produces an action a ∈A :
f : X → A
x 7→ f (x)
Loss Function
A loss function evaluates an action in the context of the outcome y .
` : A×Y → R
(a,y) 7→ `(a,y)
Julia Kempe & David S. Rosenberg (NYU CDS) DS-GA 1003 / CSCI-GA 2567 January 29, 2019 11 / 29
![Page 12: Introduction to Statistical Learning Theory · Introduction to Statistical Learning Theory Julia Kempe & David S. Rosenberg NYU CDS January 29, 2019 Julia Kempe & David S. Rosenberg](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e14db33828aa47b9f53ea43/html5/thumbnails/12.jpg)
Evaluating a Prediction Function
Loss function ` evaluates a single action
How to evaluate the prediction function as a whole?
We will use the standard statistical learning theory framework.
Julia Kempe & David S. Rosenberg (NYU CDS) DS-GA 1003 / CSCI-GA 2567 January 29, 2019 12 / 29
![Page 13: Introduction to Statistical Learning Theory · Introduction to Statistical Learning Theory Julia Kempe & David S. Rosenberg NYU CDS January 29, 2019 Julia Kempe & David S. Rosenberg](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e14db33828aa47b9f53ea43/html5/thumbnails/13.jpg)
Statistical Learning Theory
Julia Kempe & David S. Rosenberg (NYU CDS) DS-GA 1003 / CSCI-GA 2567 January 29, 2019 13 / 29
![Page 14: Introduction to Statistical Learning Theory · Introduction to Statistical Learning Theory Julia Kempe & David S. Rosenberg NYU CDS January 29, 2019 Julia Kempe & David S. Rosenberg](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e14db33828aa47b9f53ea43/html5/thumbnails/14.jpg)
A Simplifying Assumption
Assume action has no e�ect on the output
includes all traditional prediction problemswhat about stock market prediction?what about stock market investing?
What about fancier problems where this does not hold?
often can be reformulated or �reduced� to problems where it does holdsee literature on reinforcement learning
Julia Kempe & David S. Rosenberg (NYU CDS) DS-GA 1003 / CSCI-GA 2567 January 29, 2019 14 / 29
![Page 15: Introduction to Statistical Learning Theory · Introduction to Statistical Learning Theory Julia Kempe & David S. Rosenberg NYU CDS January 29, 2019 Julia Kempe & David S. Rosenberg](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e14db33828aa47b9f53ea43/html5/thumbnails/15.jpg)
Setup for Statistical Learning Theory
Assume there is a data generating distribution PX×Y.
All input/output pairs (x ,y) are generated i.i.d. from PX×Y.
Want prediction function f (x) that generally �does well on average�:
`(f (x),y) is usually small, in some sense
How can we formalize this?
Julia Kempe & David S. Rosenberg (NYU CDS) DS-GA 1003 / CSCI-GA 2567 January 29, 2019 15 / 29
![Page 16: Introduction to Statistical Learning Theory · Introduction to Statistical Learning Theory Julia Kempe & David S. Rosenberg NYU CDS January 29, 2019 Julia Kempe & David S. Rosenberg](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e14db33828aa47b9f53ea43/html5/thumbnails/16.jpg)
The Risk Functional
De�nition
The risk of a prediction function f : X→A is
R(f ) = E`(f (x),y).
In words, it's the expected loss of f on a new exampe (x ,y) drawn randomly from PX×Y.
Risk function cannot be computed
Since we don't know PX×Y, we cannot compute the expectation.But we can estimate it...
Julia Kempe & David S. Rosenberg (NYU CDS) DS-GA 1003 / CSCI-GA 2567 January 29, 2019 16 / 29
![Page 17: Introduction to Statistical Learning Theory · Introduction to Statistical Learning Theory Julia Kempe & David S. Rosenberg NYU CDS January 29, 2019 Julia Kempe & David S. Rosenberg](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e14db33828aa47b9f53ea43/html5/thumbnails/17.jpg)
The Bayes Prediction Function
De�nition
A Bayes prediction function f ∗ : X→A is a function that achieves the minimal risk amongall possible functions:
f ∗ ∈ argminf
R(f ),
where the minimum is taken over all functions from X to A.
The risk of a Bayes prediction function is called the Bayes risk.
A Bayes prediction function is often called the �target function�, since it's the bestprediction function we can possibly produce.
Julia Kempe & David S. Rosenberg (NYU CDS) DS-GA 1003 / CSCI-GA 2567 January 29, 2019 17 / 29
![Page 18: Introduction to Statistical Learning Theory · Introduction to Statistical Learning Theory Julia Kempe & David S. Rosenberg NYU CDS January 29, 2019 Julia Kempe & David S. Rosenberg](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e14db33828aa47b9f53ea43/html5/thumbnails/18.jpg)
Example: Least Squares Regression
Spaces: A= Y= R
Square loss:`(a,y) = (a− y)2
Risk:
R(f ) = E[(f (x)− y)2
](homework =⇒ ) = E
[(f (x)−E[y |x ])2
]+E
[(y −E[y |x ])2
]So Bayes prediction function is
f ∗(x) = E[y |x ]
Julia Kempe & David S. Rosenberg (NYU CDS) DS-GA 1003 / CSCI-GA 2567 January 29, 2019 18 / 29
![Page 19: Introduction to Statistical Learning Theory · Introduction to Statistical Learning Theory Julia Kempe & David S. Rosenberg NYU CDS January 29, 2019 Julia Kempe & David S. Rosenberg](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e14db33828aa47b9f53ea43/html5/thumbnails/19.jpg)
Example 2: Multiclass Classi�cation
Spaces: A= Y= {1, . . . ,k}
0-1 loss:
`(a,y) = 1(a 6= y) :=
{1 if a 6= y
0 otherwise.
Risk:
R(f ) = E [1(f (x) 6= y)] = 0 ·P(f (x) = y)+1 ·P(f (x) 6= y)
= P(f (x) 6= y) ,
which is just the misclassi�cation error rate.
Bayes prediction function is just the assignment to the most likely class:
f ∗(x) ∈ argmax16c6k
P(y = c | x)
Julia Kempe & David S. Rosenberg (NYU CDS) DS-GA 1003 / CSCI-GA 2567 January 29, 2019 19 / 29
![Page 20: Introduction to Statistical Learning Theory · Introduction to Statistical Learning Theory Julia Kempe & David S. Rosenberg NYU CDS January 29, 2019 Julia Kempe & David S. Rosenberg](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e14db33828aa47b9f53ea43/html5/thumbnails/20.jpg)
But we can't compute the risk!
Can't compute R(f ) = E`(f (x),y) because we don't know PX×Y.
One thing we can do in ML/statistics/data science is
assume we have sample data.
Let Dn = ((x1,y1), . . . ,(xn,yn)) be drawn i.i.d. from PX×Y.
Let's draw some inspiration from the Strong Law of Large Numbers:If z ,z1, . . . ,zn are i.i.d. with expected value Ez , then
limn→∞ 1
n
n∑i=1
zi = Ez ,
with probability 1.
Julia Kempe & David S. Rosenberg (NYU CDS) DS-GA 1003 / CSCI-GA 2567 January 29, 2019 20 / 29
![Page 21: Introduction to Statistical Learning Theory · Introduction to Statistical Learning Theory Julia Kempe & David S. Rosenberg NYU CDS January 29, 2019 Julia Kempe & David S. Rosenberg](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e14db33828aa47b9f53ea43/html5/thumbnails/21.jpg)
The Empirical Risk
Let Dn = ((x1,y1), . . . ,(xn,yn)) be drawn i.i.d. from PX×Y.
De�nition
The empirical risk of f : X→A with respect to Dn is
R̂n(f ) =1
n
n∑i=1
`(f (xi ),yi ).
By the Strong Law of Large Numbers,
limn→∞ R̂n(f ) = R(f ),
almost surely.
But we want to �nd the f that minimizes R(f ) - will minimizing R̂n(f ) be good enough?
Julia Kempe & David S. Rosenberg (NYU CDS) DS-GA 1003 / CSCI-GA 2567 January 29, 2019 21 / 29
![Page 22: Introduction to Statistical Learning Theory · Introduction to Statistical Learning Theory Julia Kempe & David S. Rosenberg NYU CDS January 29, 2019 Julia Kempe & David S. Rosenberg](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e14db33828aa47b9f53ea43/html5/thumbnails/22.jpg)
Empirical Risk Minimization
We want risk minimizer, is empirical risk minimizer close enough?
De�nition
A function f̂ is an empirical risk minimizer if
f̂ ∈ argminf
R̂n(f ),
where the minimum is taken over all functions.
Julia Kempe & David S. Rosenberg (NYU CDS) DS-GA 1003 / CSCI-GA 2567 January 29, 2019 22 / 29
![Page 23: Introduction to Statistical Learning Theory · Introduction to Statistical Learning Theory Julia Kempe & David S. Rosenberg NYU CDS January 29, 2019 Julia Kempe & David S. Rosenberg](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e14db33828aa47b9f53ea43/html5/thumbnails/23.jpg)
Empirical Risk Minimization
PX = Uniform[0,1], Y ≡ 1 (i.e. Y is always 1).
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00x
y
PX×Y.
Julia Kempe & David S. Rosenberg (NYU CDS) DS-GA 1003 / CSCI-GA 2567 January 29, 2019 23 / 29
![Page 24: Introduction to Statistical Learning Theory · Introduction to Statistical Learning Theory Julia Kempe & David S. Rosenberg NYU CDS January 29, 2019 Julia Kempe & David S. Rosenberg](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e14db33828aa47b9f53ea43/html5/thumbnails/24.jpg)
Empirical Risk Minimization
PX = Uniform[0,1], Y ≡ 1 (i.e. Y is always 1).
● ● ●
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00x
y
A sample of size 3 from PX×Y.
Julia Kempe & David S. Rosenberg (NYU CDS) DS-GA 1003 / CSCI-GA 2567 January 29, 2019 24 / 29
![Page 25: Introduction to Statistical Learning Theory · Introduction to Statistical Learning Theory Julia Kempe & David S. Rosenberg NYU CDS January 29, 2019 Julia Kempe & David S. Rosenberg](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e14db33828aa47b9f53ea43/html5/thumbnails/25.jpg)
Empirical Risk Minimization
PX = Uniform[0,1], Y ≡ 1 (i.e. Y is always 1).
● ● ●
● ● ●● ● ●0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00x
y
A proposed prediction function:
f̂ (x) = 1(x ∈ {0.25,0.5,0.75}) =
{1 if x ∈ {0.25, .5, .75}
0 otherwise
Julia Kempe & David S. Rosenberg (NYU CDS) DS-GA 1003 / CSCI-GA 2567 January 29, 2019 25 / 29
![Page 26: Introduction to Statistical Learning Theory · Introduction to Statistical Learning Theory Julia Kempe & David S. Rosenberg NYU CDS January 29, 2019 Julia Kempe & David S. Rosenberg](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e14db33828aa47b9f53ea43/html5/thumbnails/26.jpg)
Empirical Risk Minimization
PX = Uniform[0,1], Y ≡ 1 (i.e. Y is always 1).
● ● ●
● ● ●● ● ●0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00x
y
Under square loss or 0/1 loss: f̂ has Empirical Risk = 0 and Risk = 1.
Julia Kempe & David S. Rosenberg (NYU CDS) DS-GA 1003 / CSCI-GA 2567 January 29, 2019 26 / 29
![Page 27: Introduction to Statistical Learning Theory · Introduction to Statistical Learning Theory Julia Kempe & David S. Rosenberg NYU CDS January 29, 2019 Julia Kempe & David S. Rosenberg](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e14db33828aa47b9f53ea43/html5/thumbnails/27.jpg)
Empirical Risk Minimization
ERM led to a function f that just memorized the data.
How to spread information or �generalize� from training inputs to new inputs?
Need to smooth things out somehow...
A lot of modeling is about spreading and extrapolating information from one part of theinput space X into unobserved parts of the space.
One approach: �Constrained ERM�
Instead of minimizing empirical risk over all prediction functions,constrain to a particular subset, called a hypothesis space.
Julia Kempe & David S. Rosenberg (NYU CDS) DS-GA 1003 / CSCI-GA 2567 January 29, 2019 27 / 29
![Page 28: Introduction to Statistical Learning Theory · Introduction to Statistical Learning Theory Julia Kempe & David S. Rosenberg NYU CDS January 29, 2019 Julia Kempe & David S. Rosenberg](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e14db33828aa47b9f53ea43/html5/thumbnails/28.jpg)
Hypothesis Spaces
De�nition
A hypothesis space F is a set of functions mapping X→A.
It is the collection of prediction functions we are choosing from.
Want Hypothesis Space that...
Includes only those functions that have desired �regularity�
e.g. smoothness, simplicity
Easy to work with
Example hypothesis spaces?
Julia Kempe & David S. Rosenberg (NYU CDS) DS-GA 1003 / CSCI-GA 2567 January 29, 2019 28 / 29
![Page 29: Introduction to Statistical Learning Theory · Introduction to Statistical Learning Theory Julia Kempe & David S. Rosenberg NYU CDS January 29, 2019 Julia Kempe & David S. Rosenberg](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e14db33828aa47b9f53ea43/html5/thumbnails/29.jpg)
Constrained Empirical Risk Minimization
Hypothesis space F, a set of [prediction] functions mapping X→A
Empirical risk minimizer (ERM) in F is
f̂n ∈ argminf∈F
1
n
n∑i=1
`(f (xi ),yi ).
Risk minimizer in F is f ∗F ∈ F , where
f ∗F ∈ argminf∈F
E`(f (x),y).
Julia Kempe & David S. Rosenberg (NYU CDS) DS-GA 1003 / CSCI-GA 2567 January 29, 2019 29 / 29