m697: analysis and machine learning fall 2017nguillen/697/697week1.pdfpart ii 1.some functional...

78
M697: Analysis and Machine Learning Fall 2017 Nestor Guillen

Upload: others

Post on 27-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: M697: Analysis and Machine Learning Fall 2017nguillen/697/697Week1.pdfPart II 1.Some functional analysis + calculus of variations. 2.Basics of geometric measure theory and -convergence

M697: Analysis and Machine LearningFall 2017

Nestor Guillen

Page 2: M697: Analysis and Machine Learning Fall 2017nguillen/697/697Week1.pdfPart II 1.Some functional analysis + calculus of variations. 2.Basics of geometric measure theory and -convergence

Warmup

What are we looking at? This is a plot of (xi, yi), where

yi = βxi + β0 + 10εi, i = 1, . . . , 50.

xi = 50 pts in [−100, 100], εi = 50 i.i.d. Normal(0, 1)

Page 3: M697: Analysis and Machine Learning Fall 2017nguillen/697/697Week1.pdfPart II 1.Some functional analysis + calculus of variations. 2.Basics of geometric measure theory and -convergence

Warmup

Problem: You are given (xi, yi)50i=1, and know it has thestructure above, but you don’t know values of β0 and β1.What is your best guess for β and β1?

Page 4: M697: Analysis and Machine Learning Fall 2017nguillen/697/697Week1.pdfPart II 1.Some functional analysis + calculus of variations. 2.Basics of geometric measure theory and -convergence

Warmup

First Approach: Least Squared Error

Among all affine functions, find f minimizing

J (f) =

N∑i=1

|f(xi)− yi|2

Equivalently: find β and β0 which minimize

J (β, β0) =

N∑i=1

|βxi + β0 − yi|2

Page 5: M697: Analysis and Machine Learning Fall 2017nguillen/697/697Week1.pdfPart II 1.Some functional analysis + calculus of variations. 2.Basics of geometric measure theory and -convergence

Warmup

Second Approach: Roughness Penalization

Fix λ > 0. Among all functions, find f minimizing

J (f) =

N∑i=1

|f(xi)− yi|2 + λ

∫ 10

−10|f ′′(x)|2 dx

Page 6: M697: Analysis and Machine Learning Fall 2017nguillen/697/697Week1.pdfPart II 1.Some functional analysis + calculus of variations. 2.Basics of geometric measure theory and -convergence

Warmup

Each approach involved searching within a class of functions Ffor an element f minimizing a functional such as

N∑i=1

|f(xi)− yi|2

or

N∑i=1

|f(xi)− yi|2 + λ

∫ 10

−10(f ′′(x))2 dx

Page 7: M697: Analysis and Machine Learning Fall 2017nguillen/697/697Week1.pdfPart II 1.Some functional analysis + calculus of variations. 2.Basics of geometric measure theory and -convergence

Warmup

These are variational problems, that is, we must determine

minf∈FJ (f) and argmin

FJ ,

and study the properties of the minimizers f .

Page 8: M697: Analysis and Machine Learning Fall 2017nguillen/697/697Week1.pdfPart II 1.Some functional analysis + calculus of variations. 2.Basics of geometric measure theory and -convergence

Warmup

Why these functionals J ?

What if instead of the least sum of squares, we had minimized

J (f) =

N∑i=1

|f(xi)− yi|p, p ≥ 1 ??

or

J (f) =

N∑i=1

|f(xi)− yi|p +

∫ 10

−10F (f ′′(x)) dx ??

Page 9: M697: Analysis and Machine Learning Fall 2017nguillen/697/697Week1.pdfPart II 1.Some functional analysis + calculus of variations. 2.Basics of geometric measure theory and -convergence

WarmupOne more thing!

The data yi are random variables.

Accordingly, J and thus its minimizer(s) f are also random.

In the first case, this means β and β0 are random variables.

Which leads to the question: are β and β0 good estimatorsfor β and β0?

Page 10: M697: Analysis and Machine Learning Fall 2017nguillen/697/697Week1.pdfPart II 1.Some functional analysis + calculus of variations. 2.Basics of geometric measure theory and -convergence

Demystifying MLA lie –a reductionistic and useful one

All of the methods in machine learning boil down to solvingsome variational problem.

(but that’s not what machine learning is about of course, thatwould be like saying classical physics is all about finding criticalpoints of Lagrangians . . .but there is some value in being awareclassical physics shares this unifying framework)

That being said.Determining which classes F and objectives J are mostconvenient for a given situation is a challenging task, and thenthere is the issue of which algorithms are most effective incalculating the minimizer(s) f , and then of assessing how goodf is at making predictions. . .

Page 11: M697: Analysis and Machine Learning Fall 2017nguillen/697/697Week1.pdfPart II 1.Some functional analysis + calculus of variations. 2.Basics of geometric measure theory and -convergence

SetupWhat this course aims to cover

1. The basics of statistical inference, the fundamental issues,and basic techniques (particularly linear methods).

2. Real analysis topics of relevance to statistical inference:e.g. the manifold ways of measuring the difference betweentwo functions or measures (norms), the geometric andtopological properties of these norms.

3. The calculus of variations, a branch of mathematics thatdeals with the properties of minimizers of functionalsarising in physics, geometry, and optimization.

4. ML methods intrinsically related to nonlinear problemsarising in the calculus of variations and vice-versa(manifold learning, spectral clustering, optimal transport).

Page 12: M697: Analysis and Machine Learning Fall 2017nguillen/697/697Week1.pdfPart II 1.Some functional analysis + calculus of variations. 2.Basics of geometric measure theory and -convergence

SetupWhat this course is NOT

1. A good substitute for an introduction to machine learningor statistical inference.

2. A good substitute for a measure theory course, or astatistics course, or a probability course, or a PDE course.

3. A course where the instructor will display a greatcommand of programming skills.

4. A course with very difficult programming assignments.

5. A course with difficult math problem sets. I mean definedifficult?

6. A course with a lot of mathematical theory that turns outto be of little or no relevance to the practice of ML,predictive modeling, and statistics (well, hopefully).

Page 13: M697: Analysis and Machine Learning Fall 2017nguillen/697/697Week1.pdfPart II 1.Some functional analysis + calculus of variations. 2.Basics of geometric measure theory and -convergence

SetupAn aspirational plan for the semester

Part I

1. Linear methods for regression.

2. Linear methods for classification.

3. Unsupervised learning (emphasis on clustering).

4. Selecting & judging learning methods (“Modelassessment”).

Page 14: M697: Analysis and Machine Learning Fall 2017nguillen/697/697Week1.pdfPart II 1.Some functional analysis + calculus of variations. 2.Basics of geometric measure theory and -convergence

SetupAn aspirational plan for the semester

Part II

1. Some functional analysis + calculus of variations.

2. Basics of geometric measure theory and Γ-convergence.

3. Spectral theory + geometry (graphs & continuum).

4. Manifolds and manifold learning.

5. Optimal Transport and the Monge-Kantorovich distance.

Page 15: M697: Analysis and Machine Learning Fall 2017nguillen/697/697Week1.pdfPart II 1.Some functional analysis + calculus of variations. 2.Basics of geometric measure theory and -convergence

SetupPrerequisites

Minimal requirements:Undergraduate real analysis (basics of metric spaces,integration), basic probability (distributions, random variables,law of large numbers), and a strong background in calculus andlinear algebra.Basic programming skills, ideally Python or C++, but otherlanguages are ok (R or Matlab should be fine too). I will beusing Python.

Ideal requirements:Familiarity with one or more of the following topics: measuretheory, differentiable manifolds, linear programming, spectralgraph theory, calculus of variations, functional analysis, andpartial differential equations.

Page 16: M697: Analysis and Machine Learning Fall 2017nguillen/697/697Week1.pdfPart II 1.Some functional analysis + calculus of variations. 2.Basics of geometric measure theory and -convergence

SetupSome notation for this semester

Euclidean Space and linear algebraAs usual, R = the set of real numbers

Rp = (x, . . . , xp) | xi ∈ R

Elements of Rp will be denoted simply by x, hopefullywithout causing too much confusion. Matrices, that is,linear operators Rp 7→ Rq will be denoted by a capital letter.We will also use the following notation

Mx, x · y, Mx · y, (Mx, y)

Page 17: M697: Analysis and Machine Learning Fall 2017nguillen/697/697Week1.pdfPart II 1.Some functional analysis + calculus of variations. 2.Basics of geometric measure theory and -convergence

SetupSome notation for this semester

A Probability Space is three things (Ω,Σ, µ):

A set Ω, a σ-algebra Σ, and a measure µ with µ(Ω) = 1.

Page 18: M697: Analysis and Machine Learning Fall 2017nguillen/697/697Week1.pdfPart II 1.Some functional analysis + calculus of variations. 2.Basics of geometric measure theory and -convergence

SetupSome notation for this semester

Conditional Probability:

P(A | B) :=P(A ∩B)

P(B)

It is said A is independent from B if

P(A | B) = P(A)

That is, if

P(A ∩B) = P(A)P(B).

Page 19: M697: Analysis and Machine Learning Fall 2017nguillen/697/697Week1.pdfPart II 1.Some functional analysis + calculus of variations. 2.Basics of geometric measure theory and -convergence

SetupSome notation for this semester

Random Variables:A random variable is a function

Ω 7→ X

if X is the real numbers, we say we have a real random variable.Likewise, we can have a random vector or random matrix, andso on.Most of the time, the random variables one deals with arereal-valued.

Page 20: M697: Analysis and Machine Learning Fall 2017nguillen/697/697Week1.pdfPart II 1.Some functional analysis + calculus of variations. 2.Basics of geometric measure theory and -convergence

SetupSome notation for this semester

Random variables, independence of random variables.

What we mean by “sampling from a distribution”, theoretically,and practically.

Page 21: M697: Analysis and Machine Learning Fall 2017nguillen/697/697Week1.pdfPart II 1.Some functional analysis + calculus of variations. 2.Basics of geometric measure theory and -convergence

SetupSome notation for this semester

The Law of Large numbersLet X1, X2, . . . be a sequence of i.i.d. real random variables.Suppose also that the second moment of the Xi is finite.

Then, if µ denotes their common mean, we have

limN→∞

1

N

N∑i=1

Xi = µ with probability 1

Page 22: M697: Analysis and Machine Learning Fall 2017nguillen/697/697Week1.pdfPart II 1.Some functional analysis + calculus of variations. 2.Basics of geometric measure theory and -convergence

SetupPython

If you decide to use Python (which I would advise for thiscourse), you should acquaint yourself with the followinglibraries:

• matplotlib

• scipy

• numpy

• scikit-learn

Page 23: M697: Analysis and Machine Learning Fall 2017nguillen/697/697Week1.pdfPart II 1.Some functional analysis + calculus of variations. 2.Basics of geometric measure theory and -convergence

Congratulations, you’ve found an easter egg. To prove you’vescrolled this far down on the slides on the first two weeks ofclasses, send me an email with the subject line “Elementary,dear Data!”

Page 24: M697: Analysis and Machine Learning Fall 2017nguillen/697/697Week1.pdfPart II 1.Some functional analysis + calculus of variations. 2.Basics of geometric measure theory and -convergence

All right, now we can finally start!

Page 25: M697: Analysis and Machine Learning Fall 2017nguillen/697/697Week1.pdfPart II 1.Some functional analysis + calculus of variations. 2.Basics of geometric measure theory and -convergence

The basic problems

Problem 1: Supervised LearningThere is an unknown function

f : X 7→ Y

between two sets X and Y , the values of which are given to usfor some large but finite set x1, x2, . . . , xN ⊂ X.

Then, based on the data f(xi) = yi for i = 1, 2, . . . , n, and aninput x ∈ X, one seeks a rule to estimate or guess the f(x).

Page 26: M697: Analysis and Machine Learning Fall 2017nguillen/697/697Week1.pdfPart II 1.Some functional analysis + calculus of variations. 2.Basics of geometric measure theory and -convergence

The basic problems

Problem 2: Unsupervised learningYou are given two set of points x1, . . . , xN ⊂ Rp.

You are told these points were sampled from some distributionµ, and you must determine as much about µ as possible.

e.g. is µ a normal? is the support of the distribution connected?is the distribution supported along a curve?

Page 27: M697: Analysis and Machine Learning Fall 2017nguillen/697/697Week1.pdfPart II 1.Some functional analysis + calculus of variations. 2.Basics of geometric measure theory and -convergence

The basic problems

These two problems, as stated, are (technically) hopeless.

In practice, there is way more structure that often makes theseproblems well posed. A few examples:

• We know that f : X 7→ Y lies or is well approximated in aspecial class of functions (linear functions, quadraticfunctions, spectrally constrained functions, and so on).

• The observed pairs (xi, yi) may be random variablesdistributed via some probability distribution. In thiscase one may choose f(x) so as to minimize the expectedvalue of |f(x)− y|2 or |f(x)− y|.

• The nature of the problem is such that f is reasonablyrepresented via a multi-layer neural net.

Page 28: M697: Analysis and Machine Learning Fall 2017nguillen/697/697Week1.pdfPart II 1.Some functional analysis + calculus of variations. 2.Basics of geometric measure theory and -convergence

The basic problems

Let us see, for instance, how searching within linear functionsmakes the first problem better posed.

Page 29: M697: Analysis and Machine Learning Fall 2017nguillen/697/697Week1.pdfPart II 1.Some functional analysis + calculus of variations. 2.Basics of geometric measure theory and -convergence

Least Squares

We are given training data

(xi, yi) i = 1, 2, . . . , N

where X = Rp and Y = R.

Attempt #1 (Linear Fit): find β ∈ Rp and β0 ∈ R such that

xi · β + β0 = yi, i = 1, . . . , N.

Then, given x ∈ Rp, we let

f(x) = x · β + β0

Page 30: M697: Analysis and Machine Learning Fall 2017nguillen/697/697Week1.pdfPart II 1.Some functional analysis + calculus of variations. 2.Basics of geometric measure theory and -convergence

Least Squares

We are given training data

(xi, yi) i = 1, 2, . . . , N

where X = Rp and Y = R.

Attempt #2 (Least Squares): find β ∈ Rp and β0 ∈ R such that

RSS(β) =

N∑i=1

|xi · β + β0 − yi|2

is minimized. Then, given x ∈ Rp, we let

f(x) = x · β + β0

Page 31: M697: Analysis and Machine Learning Fall 2017nguillen/697/697Week1.pdfPart II 1.Some functional analysis + calculus of variations. 2.Basics of geometric measure theory and -convergence

Least Squares

Let X the linear transformation defined by

Xβ := (x1 · β, . . . , xN · β)

and let y denote the vector (y1, . . . , yN ).

Then, RSS(β) is simply the squared length of the vector

Xβ − y

In particular, RSS(β) is a convex function of β.

Page 32: M697: Analysis and Machine Learning Fall 2017nguillen/697/697Week1.pdfPart II 1.Some functional analysis + calculus of variations. 2.Basics of geometric measure theory and -convergence

Least Squares

Using the chain rule, it follows that

∇RSS(β) = 2Xt(Xβ − y)

Therefore, the following equation yields the minimizer(s)

XtXβ = Xty

If the matrix XtX happens to be invertible, we have

β = (XtX)−1Xty

In this way, the data (X, y) yields a best guess β (and thus a f).

Page 33: M697: Analysis and Machine Learning Fall 2017nguillen/697/697Week1.pdfPart II 1.Some functional analysis + calculus of variations. 2.Basics of geometric measure theory and -convergence

Least Squares

Now, if we are given a new input x, our best guess for f(x) is

f(x) = x · β.

This is still just a guess, and there are a number of ways ofestimating how reliable it is.

Page 34: M697: Analysis and Machine Learning Fall 2017nguillen/697/697Week1.pdfPart II 1.Some functional analysis + calculus of variations. 2.Basics of geometric measure theory and -convergence

Least SquaresExample: Classification

We are given a two finite set of points in the plane, Red andBlue, and whose union is denoted by x1, . . . , xN.

Consider the function f : S 7→ R given by

f(xi) =

1 if xi ∈ Red−1 if xi ∈ Blue

Let us do a linear regression on (xi, yi)Ni=1, where yi = f(xi).

Page 35: M697: Analysis and Machine Learning Fall 2017nguillen/697/697Week1.pdfPart II 1.Some functional analysis + calculus of variations. 2.Basics of geometric measure theory and -convergence

Least SquaresExample: Classification

Then, given x ∈ R2, we proceed as follows

f(x)

> 0 we classify x as Red≤ 0 we classify x as Blue

This is an example of a linear classification.

Page 36: M697: Analysis and Machine Learning Fall 2017nguillen/697/697Week1.pdfPart II 1.Some functional analysis + calculus of variations. 2.Basics of geometric measure theory and -convergence

Least SquaresExample: Classification

I mean, what could go wrong?

Page 37: M697: Analysis and Machine Learning Fall 2017nguillen/697/697Week1.pdfPart II 1.Some functional analysis + calculus of variations. 2.Basics of geometric measure theory and -convergence

Least SquaresExample: Classification

I mean, what could go wrong?

Page 38: M697: Analysis and Machine Learning Fall 2017nguillen/697/697Week1.pdfPart II 1.Some functional analysis + calculus of variations. 2.Basics of geometric measure theory and -convergence

Least SquaresExample: Classification

I mean, what could go wrong?

Page 39: M697: Analysis and Machine Learning Fall 2017nguillen/697/697Week1.pdfPart II 1.Some functional analysis + calculus of variations. 2.Basics of geometric measure theory and -convergence

Least SquaresExample: Classification

I mean, what could go wrong?Many modes clusters may make it hard to have a single linear

classifier

Page 40: M697: Analysis and Machine Learning Fall 2017nguillen/697/697Week1.pdfPart II 1.Some functional analysis + calculus of variations. 2.Basics of geometric measure theory and -convergence

Last but not least, a preview(of the second half of the semester)

First: Calculus of variations and elliptic equations

Consider the functional

J (u) :=1

2

∫D|∇u|2 dx+

∫Df(x)u(x) dx

Minimizers of this problem solve Poisson’s equation

∆u = f in D

Page 41: M697: Analysis and Machine Learning Fall 2017nguillen/697/697Week1.pdfPart II 1.Some functional analysis + calculus of variations. 2.Basics of geometric measure theory and -convergence

Last but not least, a preview(of the second half of the semester)

First: Calculus of variations and elliptic equations

What if instead we now had

J (u) := λ

∫D|∇u|2 dx+

∫D

(f(x)− u(x))2 dx

Minimizers of this problem instead solve

λ∆u+ u = f in D

Page 42: M697: Analysis and Machine Learning Fall 2017nguillen/697/697Week1.pdfPart II 1.Some functional analysis + calculus of variations. 2.Basics of geometric measure theory and -convergence

Last but not least, a preview(of the second half of the semester)

Second: The Perimeter and Plateau’s problem

The perimeter of a set E ⊂ Rd is defined as

Per(E) =

∫|∇χE | dx

where∫D|∇χE | dx = sup

∫E

div(g(x)) dx | g : D 7→ Rd ‖g‖∞ ≤ 1

It was first Ennio De Giorgi who figured this was an accurate,and useful way of expressing the surface area of a set.

Page 43: M697: Analysis and Machine Learning Fall 2017nguillen/697/697Week1.pdfPart II 1.Some functional analysis + calculus of variations. 2.Basics of geometric measure theory and -convergence

Last but not least, a preview(of the second half of the semester)

Second: The Perimeter and Plateau’s problem

A useful functional when analyzing images is

λPer(E) +

∫|f(x)− χE |2 dx

Here f : [0, 1]× [0, 1] 7→ R represents a gray scale image.

Problems involving the minimization of Per(E) lead to minimalsurfaces, and their study have led to the development of thecalculus of variations, differential geometry, and measure theory.

Page 44: M697: Analysis and Machine Learning Fall 2017nguillen/697/697Week1.pdfPart II 1.Some functional analysis + calculus of variations. 2.Basics of geometric measure theory and -convergence

All right, that’s all for the first class.

Don’t worry, a whole semester of Data awaits us!.

Page 45: M697: Analysis and Machine Learning Fall 2017nguillen/697/697Week1.pdfPart II 1.Some functional analysis + calculus of variations. 2.Basics of geometric measure theory and -convergence

M 697Today:

Least squares, k-NN, and some probability

Nestor Guillen

Page 46: M697: Analysis and Machine Learning Fall 2017nguillen/697/697Week1.pdfPart II 1.Some functional analysis + calculus of variations. 2.Basics of geometric measure theory and -convergence

Least Squares

We considered the situation where we were given data

x1, . . . , xN ⊂ Rp, y1, . . . , yN ⊂ R,

and wanted to minimize, over β ∈ Rp and β0 ∈ R,

RSS(β, β0) =

N∑i=1

|xi · β + β0 − yi|2.

Page 47: M697: Analysis and Machine Learning Fall 2017nguillen/697/697Week1.pdfPart II 1.Some functional analysis + calculus of variations. 2.Basics of geometric measure theory and -convergence

Least Squares

We may augment every xi ∈ Rp with a dummy 0-th coordinate

xi = (1, xi1, xi2, . . . , xip)

so that xi · β + β0 = xi · β, where β = (β0, β1, β2, . . . , βp).

Thus, it suffices to think of the problem of minimizing

RSS(β) =

N∑i=1

|xi · β − yi|2.

Page 48: M697: Analysis and Machine Learning Fall 2017nguillen/697/697Week1.pdfPart II 1.Some functional analysis + calculus of variations. 2.Basics of geometric measure theory and -convergence

Least Squares

From the data, one may define a linear map / N × p matrix

X : Rp 7→ RN

given by Xβ := (x1 · β, . . . , xN · β), as well as a vector y ∈ RN

y = (y1, . . . , yN ).

minimize RSS(β) = ‖Xβ − y‖22 over β ∈ Rp.

Page 49: M697: Analysis and Machine Learning Fall 2017nguillen/697/697Week1.pdfPart II 1.Some functional analysis + calculus of variations. 2.Basics of geometric measure theory and -convergence

Least Squares

∇RSS(β) = Xt (Xβ − y)

A vector β achieves the minimum if and only if it solves

Xt (Xβ − y) = 0

equivalently, if it solves

(XtX)β = Xty

Page 50: M697: Analysis and Machine Learning Fall 2017nguillen/697/697Week1.pdfPart II 1.Some functional analysis + calculus of variations. 2.Basics of geometric measure theory and -convergence

Least SquaresA quick linear algebra refresher

What does the transpose do? If M is a n×m matrix, then

(M t)ij = Mji

is a m× n matrix: the rows of one are the columns of the other.

An equivalent and very useful definition is given by the identity

(Mx) · y = x · (M ty) ∀ x ∈ Rm, y ∈ Rn.

Page 51: M697: Analysis and Machine Learning Fall 2017nguillen/697/697Week1.pdfPart II 1.Some functional analysis + calculus of variations. 2.Basics of geometric measure theory and -convergence

Least SquaresA quick linear algebra refresher

This is why, if we think of M as

M : Rm 7→ Rn

then MT corresponds to a map in the opposite direction

M t : Rn 7→ Rm

Page 52: M697: Analysis and Machine Learning Fall 2017nguillen/697/697Week1.pdfPart II 1.Some functional analysis + calculus of variations. 2.Basics of geometric measure theory and -convergence

Least SquaresXtX

Then, back to our equation

(XtX)β = Xty.

We see that XtX : Rp 7→ Rp, is this matrix invertible?.

Observe that (XtX)β = 0⇒ (XtX)β · β = 0.

in this case, however,

0 = (XtX)β · β = Xt(Xβ) · β = Xβ ·Xβ = |Xβ|2

Page 53: M697: Analysis and Machine Learning Fall 2017nguillen/697/697Week1.pdfPart II 1.Some functional analysis + calculus of variations. 2.Basics of geometric measure theory and -convergence

Least SquaresXtX

In other words (XtX)β = 0⇔ Xβ = 0.

Now, what does Xβ = 0 mean? Well, this is the same as

x1 · β = x2 · β = . . . = xN · β = 0

That is, β is orthogonal to all of the vectors x1, . . . , xN.

Conclusion:XtX is invertible ⇔ x1, . . . , xN spans Rp.

Page 54: M697: Analysis and Machine Learning Fall 2017nguillen/697/697Week1.pdfPart II 1.Some functional analysis + calculus of variations. 2.Basics of geometric measure theory and -convergence

Least SquaresX(XtX)−1Xt

Let us then, consider only cases where x1, . . . , xN spans Rp.

Then, the least squares solution β is unique and given by

β = (XtX)−1Xty

and, the respective best linear fit for y, is the vector

y = Xβ = X(XtX)−1Xty

Page 55: M697: Analysis and Machine Learning Fall 2017nguillen/697/697Week1.pdfPart II 1.Some functional analysis + calculus of variations. 2.Basics of geometric measure theory and -convergence

Least SquaresX(XtX)−1Xt

Note: The N ×N matrix,

P = X(XtX)−1Xt

is the orthogonal projection onto the image of the map X.

This means that

Pt = P, P2 = P, Im(P) = Im(X)

Page 56: M697: Analysis and Machine Learning Fall 2017nguillen/697/697Week1.pdfPart II 1.Some functional analysis + calculus of variations. 2.Basics of geometric measure theory and -convergence

Least SquaresA warning about (XtX)−1

An important issue in practice is when XtX is close to beingsingular, which may mean that

(XtX)−1

is too large, creating all kind of computational issues.

Question: What needs to happen for XtX to be close to beingnon invertible? how can those situations be avoided?

Page 57: M697: Analysis and Machine Learning Fall 2017nguillen/697/697Week1.pdfPart II 1.Some functional analysis + calculus of variations. 2.Basics of geometric measure theory and -convergence

Nearest Neighbor

Let us move on to a different method: k-Nearest Neighbors

Page 58: M697: Analysis and Machine Learning Fall 2017nguillen/697/697Week1.pdfPart II 1.Some functional analysis + calculus of variations. 2.Basics of geometric measure theory and -convergence

k-Nearest Neighbor

As before, we start from a training data set

x1, . . . , xN ⊂ Rp, y1, . . . , yN ⊂ R

and assuming this corresponds to y = f(x), we estimate f .

Fix k ∈ N, then the k-Nearest Neighbor algorithm returns

f(x) =∑

xi∈Nk(x)

yi

where

Nk(x) = xi | xi is among the k data points closest to x

Page 59: M697: Analysis and Machine Learning Fall 2017nguillen/697/697Week1.pdfPart II 1.Some functional analysis + calculus of variations. 2.Basics of geometric measure theory and -convergence

k-Nearest NeighborVoronoi Diagrams and 1-NN

The case k = 1, corresponds to simply taking

f(x) = f(xi), where xi is closest to x.

Note that in any case, f is a piecewise constant function.

The “level sets” of f are given by the Voronoi Diagram of theset x1, . . . , xn

Page 60: M697: Analysis and Machine Learning Fall 2017nguillen/697/697Week1.pdfPart II 1.Some functional analysis + calculus of variations. 2.Basics of geometric measure theory and -convergence

k-Nearest NeighborVoronoi Diagrams and 1-NN

Page 61: M697: Analysis and Machine Learning Fall 2017nguillen/697/697Week1.pdfPart II 1.Some functional analysis + calculus of variations. 2.Basics of geometric measure theory and -convergence

k-Nearest Neighbors

Nk(x) = xi | xi is among the k data points closest to x

Question: What if there are points xi and xj equidistant to x?

Page 62: M697: Analysis and Machine Learning Fall 2017nguillen/697/697Week1.pdfPart II 1.Some functional analysis + calculus of variations. 2.Basics of geometric measure theory and -convergence

k-Nearest NeighborsBack to the classification example

Assume the data came from a classification problem with 2classes

f(x) :=1

k

∑xi∈Nk(x)

yi

Since f(x) is classified in class 1 or 2 according to whetherf > 0.5 or f ≤ 0.5, the rule k-NN provides is the following

Over a half of Nk(x) lies in class 1⇒ x ∈ Class 1

At most half of Nk(x) lies in class 2⇒ x ∈ Class 2

Page 63: M697: Analysis and Machine Learning Fall 2017nguillen/697/697Week1.pdfPart II 1.Some functional analysis + calculus of variations. 2.Basics of geometric measure theory and -convergence

k-Nearest Neighbors

Important:Specially for high dimensions, determining Nk(x) (even fork = 1) may entail a prohibitive computation time!

(we’ll discuss this further later today and in next class)

Page 64: M697: Analysis and Machine Learning Fall 2017nguillen/697/697Week1.pdfPart II 1.Some functional analysis + calculus of variations. 2.Basics of geometric measure theory and -convergence

Least Squares Versus k-Nearest Neighbor

What approach is best here?Least Squares, or k-nearest neighbor?

The answer depends on how the data was generated.

For example:

If the data arose from two multinomial Gaussians, leastsquares.If the data arose from many highly localized Gaussians, k-NN.

Page 65: M697: Analysis and Machine Learning Fall 2017nguillen/697/697Week1.pdfPart II 1.Some functional analysis + calculus of variations. 2.Basics of geometric measure theory and -convergence

Expected Prediction Error

Statistical Modeling can help us navigate and judge variousmethods to analyze data

Page 66: M697: Analysis and Machine Learning Fall 2017nguillen/697/697Week1.pdfPart II 1.Some functional analysis + calculus of variations. 2.Basics of geometric measure theory and -convergence

Expected Prediction Error

ModelWe have two random variables X and Y taking values in Rdand R, respectively.

We seek f : Rd 7→ R which intends to get to a deterministicrelationship between X and Y .

Statistical Criterium:Choose f so that it minimizes the

E[|f(X)− Y |2]

this is known as the Expected Prediction Error (EPE).

Page 67: M697: Analysis and Machine Learning Fall 2017nguillen/697/697Week1.pdfPart II 1.Some functional analysis + calculus of variations. 2.Basics of geometric measure theory and -convergence

Expected Prediction Error

Recall the total expectation formula

E[|f(X)− Y |2] = E[E[|f(X)− Y |2 | X]]

In other words,

E[|f(X)− Y |2] = E[g(X)],

where

g(x) = E[|f(X)− Y |2 | X = x]

Page 68: M697: Analysis and Machine Learning Fall 2017nguillen/697/697Week1.pdfPart II 1.Some functional analysis + calculus of variations. 2.Basics of geometric measure theory and -convergence

Expected Squared Prediction Error

This leads to the best answer, which is given by

f(x) = E[Y | X = x]

This is often known as the statistical regression function.

Page 69: M697: Analysis and Machine Learning Fall 2017nguillen/697/697Week1.pdfPart II 1.Some functional analysis + calculus of variations. 2.Basics of geometric measure theory and -convergence

Expected Squared Prediction Error

What about a measure other than mean square error?

If one instead decides to minimize not the square, but theabsolute value

E[|X − Y |]

Then, arguing as above, one is led to

f(x) = median[Y | X = x]

Page 70: M697: Analysis and Machine Learning Fall 2017nguillen/697/697Week1.pdfPart II 1.Some functional analysis + calculus of variations. 2.Basics of geometric measure theory and -convergence

Expected Squared Prediction Error

What about a measure other than mean square error?

Most generally, one could consider L : R× R 7→ R, a lossfunction or error function. Then, we want to minimize

E[L(Y, f(X))]

and, the respective regression function would be defined as

f(x) := argming∈R

E[L(Y, g) | X = x]

Page 71: M697: Analysis and Machine Learning Fall 2017nguillen/697/697Week1.pdfPart II 1.Some functional analysis + calculus of variations. 2.Basics of geometric measure theory and -convergence

Expected Squared Prediction Error

What about a measure other than mean square error?

In the context of classification, a popular loss function is thezero-one function.

The Bayes classifier, is the associated regression function

f(x) := argmaxg

P[Y = g | X = x]

Page 72: M697: Analysis and Machine Learning Fall 2017nguillen/697/697Week1.pdfPart II 1.Some functional analysis + calculus of variations. 2.Basics of geometric measure theory and -convergence

Expected Squared Prediction ErrorThe return of Least Squares and k-NN

How may we estimate E[Y | X = x]?

As it turns out, both Least Squares and k-NN are effectivelyapproximating this conditional expectation –under variousregimes and assumptions, that is.

Page 73: M697: Analysis and Machine Learning Fall 2017nguillen/697/697Week1.pdfPart II 1.Some functional analysis + calculus of variations. 2.Basics of geometric measure theory and -convergence

Expected Squared Prediction Error

k-NN.

Fix k, and let

f(x) :=1

k

∑xi∈Nk(x)

yi

Under rather mild assumptions*, one can show that as thesample N →∞, we have

f(x)→ E[Y | X = x]

(*Recall the Law of Large Numbers!)

Page 74: M697: Analysis and Machine Learning Fall 2017nguillen/697/697Week1.pdfPart II 1.Some functional analysis + calculus of variations. 2.Basics of geometric measure theory and -convergence

Expected Squared Prediction Error

Least Squares.

Going back to

EPE(f) := E[|f(X)− Y |2]

Assume that the minimizer f is linear, or well approximated bya linear function. Then, it makes sense of find the minimizer ofEPE(f) within the class of linear functions.

Page 75: M697: Analysis and Machine Learning Fall 2017nguillen/697/697Week1.pdfPart II 1.Some functional analysis + calculus of variations. 2.Basics of geometric measure theory and -convergence

Expected Squared Prediction Error

Least Squares.

In this case, let us write

fβ(x) = β · x

and seek the minimizer of

β 7→ EPE(fβ), β ∈ Rd.

Page 76: M697: Analysis and Machine Learning Fall 2017nguillen/697/697Week1.pdfPart II 1.Some functional analysis + calculus of variations. 2.Basics of geometric measure theory and -convergence

Expected Squared Prediction Error

Least Squares.

This is a convex functional in β, so the minimizer is given by

∇βEPE(fβ) = 0

That is

E[(X · β − Y )X] = 0

Which may be rewritten as

(E[(X ⊗X])β = E[Y X]

Page 77: M697: Analysis and Machine Learning Fall 2017nguillen/697/697Week1.pdfPart II 1.Some functional analysis + calculus of variations. 2.Basics of geometric measure theory and -convergence

Expected Squared Prediction Error

Least Squares is back!

If we average over the training data, then the solution of

(E[(X ⊗X])β = E[Y X]

corresponds to least squares solution!

The moral of the story: Both k-NN and Least Squarescorrespond to the minimization of the Expected SquaredPrediction Error, the former when we expected the regressionfunction to be well approximated locally by constants, and thelatter when minimizing within affine functions only.

Page 78: M697: Analysis and Machine Learning Fall 2017nguillen/697/697Week1.pdfPart II 1.Some functional analysis + calculus of variations. 2.Basics of geometric measure theory and -convergence

The first week, in one slide

1. Learning problems: supervised and unsupervised.

2. Supervised learning: Regression (quantitive) andclassification (categorical).

3. ML problems lead naturally to variational problems.

4. Least Squares yields a regression method is rigid.

5. k-Nearest Neighbors is a regression method which isflexibly, but less stable than least squares.

6. Learning methods/models involve parameters that quantifytheir complexity. One must decide what degree of variance,bias, and stability/rigidity are best for a given problem.

7. Given a statistical model for our data, minimization of EPEleads (under various regimes) to Least Squares or k-NN.