m697: analysis and machine learning fall 2017nguillen/697/697week1.pdfpart ii 1.some functional...

M697: Analysis and Machine LearningFall 2017

Nestor Guillen

Warmup

What are we looking at? This is a plot of (xi, yi), where

yi = βxi + β0 + 10εi, i = 1, . . . , 50.

xi = 50 pts in [−100, 100], εi = 50 i.i.d. Normal(0, 1)

Warmup

Problem: You are given (xi, yi)50i=1, and know it has thestructure above, but you don’t know values of β0 and β1.What is your best guess for β and β1?

Warmup

First Approach: Least Squared Error

Among all affine functions, find f minimizing

J (f) =

N∑i=1

|f(xi)− yi|2

Equivalently: find β and β0 which minimize

J (β, β0) =

N∑i=1

|βxi + β0 − yi|2

Warmup

Second Approach: Roughness Penalization

Fix λ > 0. Among all functions, find f minimizing

J (f) =

N∑i=1

|f(xi)− yi|2 + λ

∫ 10

−10|f ′′(x)|2 dx

Warmup

Each approach involved searching within a class of functions Ffor an element f minimizing a functional such as

N∑i=1

|f(xi)− yi|2

or

N∑i=1

|f(xi)− yi|2 + λ

∫ 10

−10(f ′′(x))2 dx

Warmup

These are variational problems, that is, we must determine

minf∈FJ (f) and argmin

FJ ,

and study the properties of the minimizers f .

Warmup

Why these functionals J ?

What if instead of the least sum of squares, we had minimized

J (f) =

N∑i=1

|f(xi)− yi|p, p ≥ 1 ??

or

J (f) =

N∑i=1

|f(xi)− yi|p +

∫ 10

−10F (f ′′(x)) dx ??

WarmupOne more thing!

The data yi are random variables.

Accordingly, J and thus its minimizer(s) f are also random.

In the first case, this means β and β0 are random variables.

Which leads to the question: are β and β0 good estimatorsfor β and β0?

Demystifying MLA lie –a reductionistic and useful one

All of the methods in machine learning boil down to solvingsome variational problem.

(but that’s not what machine learning is about of course, thatwould be like saying classical physics is all about finding criticalpoints of Lagrangians . . .but there is some value in being awareclassical physics shares this unifying framework)

That being said.Determining which classes F and objectives J are mostconvenient for a given situation is a challenging task, and thenthere is the issue of which algorithms are most effective incalculating the minimizer(s) f , and then of assessing how goodf is at making predictions. . .

SetupWhat this course aims to cover

1. The basics of statistical inference, the fundamental issues,and basic techniques (particularly linear methods).

2. Real analysis topics of relevance to statistical inference:e.g. the manifold ways of measuring the difference betweentwo functions or measures (norms), the geometric andtopological properties of these norms.

3. The calculus of variations, a branch of mathematics thatdeals with the properties of minimizers of functionalsarising in physics, geometry, and optimization.

4. ML methods intrinsically related to nonlinear problemsarising in the calculus of variations and vice-versa(manifold learning, spectral clustering, optimal transport).

SetupWhat this course is NOT

1. A good substitute for an introduction to machine learningor statistical inference.

2. A good substitute for a measure theory course, or astatistics course, or a probability course, or a PDE course.

3. A course where the instructor will display a greatcommand of programming skills.

4. A course with very difficult programming assignments.

5. A course with difficult math problem sets. I mean definedifficult?

6. A course with a lot of mathematical theory that turns outto be of little or no relevance to the practice of ML,predictive modeling, and statistics (well, hopefully).

SetupAn aspirational plan for the semester

Part I

1. Linear methods for regression.

2. Linear methods for classification.

3. Unsupervised learning (emphasis on clustering).

4. Selecting & judging learning methods (“Modelassessment”).

SetupAn aspirational plan for the semester

Part II

1. Some functional analysis + calculus of variations.

2. Basics of geometric measure theory and Γ-convergence.

3. Spectral theory + geometry (graphs & continuum).

4. Manifolds and manifold learning.

5. Optimal Transport and the Monge-Kantorovich distance.

SetupPrerequisites

Minimal requirements:Undergraduate real analysis (basics of metric spaces,integration), basic probability (distributions, random variables,law of large numbers), and a strong background in calculus andlinear algebra.Basic programming skills, ideally Python or C++, but otherlanguages are ok (R or Matlab should be fine too). I will beusing Python.

Ideal requirements:Familiarity with one or more of the following topics: measuretheory, differentiable manifolds, linear programming, spectralgraph theory, calculus of variations, functional analysis, andpartial differential equations.

SetupSome notation for this semester

Euclidean Space and linear algebraAs usual, R = the set of real numbers

Rp = (x, . . . , xp) | xi ∈ R

Elements of Rp will be denoted simply by x, hopefullywithout causing too much confusion. Matrices, that is,linear operators Rp 7→ Rq will be denoted by a capital letter.We will also use the following notation

Mx, x · y, Mx · y, (Mx, y)


A Probability Space is three things (Ω,Σ, µ):

A set Ω, a σ-algebra Σ, and a measure µ with µ(Ω) = 1.


Conditional Probability:

P(A | B) :=P(A ∩B)

P(B)

It is said A is independent from B if

P(A | B) = P(A)

That is, if

P(A ∩B) = P(A)P(B).


Random Variables:A random variable is a function

Ω 7→ X

if X is the real numbers, we say we have a real random variable.Likewise, we can have a random vector or random matrix, andso on.Most of the time, the random variables one deals with arereal-valued.


Random variables, independence of random variables.

What we mean by “sampling from a distribution”, theoretically,and practically.


The Law of Large numbersLet X1, X2, . . . be a sequence of i.i.d. real random variables.Suppose also that the second moment of the Xi is finite.

Then, if µ denotes their common mean, we have

limN→∞

1

N

N∑i=1

Xi = µ with probability 1

SetupPython

If you decide to use Python (which I would advise for thiscourse), you should acquaint yourself with the followinglibraries:

• matplotlib

• scipy

• numpy

• scikit-learn

Congratulations, you’ve found an easter egg. To prove you’vescrolled this far down on the slides on the first two weeks ofclasses, send me an email with the subject line “Elementary,dear Data!”

All right, now we can finally start!

The basic problems

Problem 1: Supervised LearningThere is an unknown function

f : X 7→ Y

between two sets X and Y , the values of which are given to usfor some large but finite set x1, x2, . . . , xN ⊂ X.

Then, based on the data f(xi) = yi for i = 1, 2, . . . , n, and aninput x ∈ X, one seeks a rule to estimate or guess the f(x).

The basic problems

Problem 2: Unsupervised learningYou are given two set of points x1, . . . , xN ⊂ Rp.

You are told these points were sampled from some distributionµ, and you must determine as much about µ as possible.

e.g. is µ a normal? is the support of the distribution connected?is the distribution supported along a curve?

The basic problems

These two problems, as stated, are (technically) hopeless.

In practice, there is way more structure that often makes theseproblems well posed. A few examples:

• We know that f : X 7→ Y lies or is well approximated in aspecial class of functions (linear functions, quadraticfunctions, spectrally constrained functions, and so on).

• The observed pairs (xi, yi) may be random variablesdistributed via some probability distribution. In thiscase one may choose f(x) so as to minimize the expectedvalue of |f(x)− y|2 or |f(x)− y|.

• The nature of the problem is such that f is reasonablyrepresented via a multi-layer neural net.

The basic problems

Let us see, for instance, how searching within linear functionsmakes the first problem better posed.

Least Squares

We are given training data

(xi, yi) i = 1, 2, . . . , N

where X = Rp and Y = R.

Attempt #1 (Linear Fit): find β ∈ Rp and β0 ∈ R such that

xi · β + β0 = yi, i = 1, . . . , N.

Then, given x ∈ Rp, we let

f(x) = x · β + β0

Least Squares

We are given training data

(xi, yi) i = 1, 2, . . . , N

where X = Rp and Y = R.

Attempt #2 (Least Squares): find β ∈ Rp and β0 ∈ R such that

RSS(β) =

N∑i=1

|xi · β + β0 − yi|2

is minimized. Then, given x ∈ Rp, we let

f(x) = x · β + β0

Least Squares

Let X the linear transformation defined by

Xβ := (x1 · β, . . . , xN · β)

and let y denote the vector (y1, . . . , yN ).

Then, RSS(β) is simply the squared length of the vector

Xβ − y

In particular, RSS(β) is a convex function of β.

Least Squares

Using the chain rule, it follows that

∇RSS(β) = 2Xt(Xβ − y)

Therefore, the following equation yields the minimizer(s)

XtXβ = Xty

If the matrix XtX happens to be invertible, we have

β = (XtX)−1Xty

In this way, the data (X, y) yields a best guess β (and thus a f).

Least Squares

Now, if we are given a new input x, our best guess for f(x) is

f(x) = x · β.

This is still just a guess, and there are a number of ways ofestimating how reliable it is.

Least SquaresExample: Classification

We are given a two finite set of points in the plane, Red andBlue, and whose union is denoted by x1, . . . , xN.

Consider the function f : S 7→ R given by

f(xi) =

1 if xi ∈ Red−1 if xi ∈ Blue

Let us do a linear regression on (xi, yi)Ni=1, where yi = f(xi).


Then, given x ∈ R2, we proceed as follows

f(x)

> 0 we classify x as Red≤ 0 we classify x as Blue

This is an example of a linear classification.


I mean, what could go wrong?


I mean, what could go wrong?Many modes clusters may make it hard to have a single linear

classifier

Last but not least, a preview(of the second half of the semester)

First: Calculus of variations and elliptic equations

Consider the functional

J (u) :=1

2

∫D|∇u|2 dx+

∫Df(x)u(x) dx

Minimizers of this problem solve Poisson’s equation

∆u = f in D


First: Calculus of variations and elliptic equations

What if instead we now had

J (u) := λ

∫D|∇u|2 dx+

∫D

(f(x)− u(x))2 dx

Minimizers of this problem instead solve

λ∆u+ u = f in D


Second: The Perimeter and Plateau’s problem

The perimeter of a set E ⊂ Rd is defined as

Per(E) =

∫|∇χE | dx

where∫D|∇χE | dx = sup

∫E

div(g(x)) dx | g : D 7→ Rd ‖g‖∞ ≤ 1

It was first Ennio De Giorgi who figured this was an accurate,and useful way of expressing the surface area of a set.


Second: The Perimeter and Plateau’s problem

A useful functional when analyzing images is

λPer(E) +

∫|f(x)− χE |2 dx

Here f : [0, 1]× [0, 1] 7→ R represents a gray scale image.

Problems involving the minimization of Per(E) lead to minimalsurfaces, and their study have led to the development of thecalculus of variations, differential geometry, and measure theory.

All right, that’s all for the first class.

Don’t worry, a whole semester of Data awaits us!.

M 697Today:

Least squares, k-NN, and some probability

Nestor Guillen

Least Squares

We considered the situation where we were given data

x1, . . . , xN ⊂ Rp, y1, . . . , yN ⊂ R,

and wanted to minimize, over β ∈ Rp and β0 ∈ R,

RSS(β, β0) =

N∑i=1

|xi · β + β0 − yi|2.

Least Squares

We may augment every xi ∈ Rp with a dummy 0-th coordinate

xi = (1, xi1, xi2, . . . , xip)

so that xi · β + β0 = xi · β, where β = (β0, β1, β2, . . . , βp).

Thus, it suffices to think of the problem of minimizing

RSS(β) =

N∑i=1

|xi · β − yi|2.

Least Squares

From the data, one may define a linear map / N × p matrix

X : Rp 7→ RN

given by Xβ := (x1 · β, . . . , xN · β), as well as a vector y ∈ RN

y = (y1, . . . , yN ).

minimize RSS(β) = ‖Xβ − y‖22 over β ∈ Rp.

Least Squares

∇RSS(β) = Xt (Xβ − y)

A vector β achieves the minimum if and only if it solves

Xt (Xβ − y) = 0

equivalently, if it solves

(XtX)β = Xty

Least SquaresA quick linear algebra refresher

What does the transpose do? If M is a n×m matrix, then

(M t)ij = Mji

is a m× n matrix: the rows of one are the columns of the other.

An equivalent and very useful definition is given by the identity

(Mx) · y = x · (M ty) ∀ x ∈ Rm, y ∈ Rn.

Least SquaresA quick linear algebra refresher

This is why, if we think of M as

M : Rm 7→ Rn

then MT corresponds to a map in the opposite direction

M t : Rn 7→ Rm

Least SquaresXtX

Then, back to our equation

(XtX)β = Xty.

We see that XtX : Rp 7→ Rp, is this matrix invertible?.

Observe that (XtX)β = 0⇒ (XtX)β · β = 0.

in this case, however,

0 = (XtX)β · β = Xt(Xβ) · β = Xβ ·Xβ = |Xβ|2

Least SquaresXtX

In other words (XtX)β = 0⇔ Xβ = 0.

Now, what does Xβ = 0 mean? Well, this is the same as

x1 · β = x2 · β = . . . = xN · β = 0

That is, β is orthogonal to all of the vectors x1, . . . , xN.

Conclusion:XtX is invertible ⇔ x1, . . . , xN spans Rp.

Least SquaresX(XtX)−1Xt

Let us then, consider only cases where x1, . . . , xN spans Rp.

Then, the least squares solution β is unique and given by

β = (XtX)−1Xty

and, the respective best linear fit for y, is the vector

y = Xβ = X(XtX)−1Xty

Least SquaresX(XtX)−1Xt

Note: The N ×N matrix,

P = X(XtX)−1Xt

is the orthogonal projection onto the image of the map X.

This means that

Pt = P, P2 = P, Im(P) = Im(X)

Least SquaresA warning about (XtX)−1

An important issue in practice is when XtX is close to beingsingular, which may mean that

(XtX)−1

is too large, creating all kind of computational issues.

Question: What needs to happen for XtX to be close to beingnon invertible? how can those situations be avoided?

Nearest Neighbor

Let us move on to a different method: k-Nearest Neighbors

k-Nearest Neighbor

As before, we start from a training data set

x1, . . . , xN ⊂ Rp, y1, . . . , yN ⊂ R

and assuming this corresponds to y = f(x), we estimate f .

Fix k ∈ N, then the k-Nearest Neighbor algorithm returns

f(x) =∑

xi∈Nk(x)

yi

where

Nk(x) = xi | xi is among the k data points closest to x

k-Nearest NeighborVoronoi Diagrams and 1-NN

The case k = 1, corresponds to simply taking

f(x) = f(xi), where xi is closest to x.

Note that in any case, f is a piecewise constant function.

The “level sets” of f are given by the Voronoi Diagram of theset x1, . . . , xn

k-Nearest NeighborVoronoi Diagrams and 1-NN

k-Nearest Neighbors

Nk(x) = xi | xi is among the k data points closest to x

Question: What if there are points xi and xj equidistant to x?

k-Nearest NeighborsBack to the classification example

Assume the data came from a classification problem with 2classes

f(x) :=1

k

∑xi∈Nk(x)

yi

Since f(x) is classified in class 1 or 2 according to whetherf > 0.5 or f ≤ 0.5, the rule k-NN provides is the following

Over a half of Nk(x) lies in class 1⇒ x ∈ Class 1

At most half of Nk(x) lies in class 2⇒ x ∈ Class 2

k-Nearest Neighbors

Important:Specially for high dimensions, determining Nk(x) (even fork = 1) may entail a prohibitive computation time!

(we’ll discuss this further later today and in next class)

Least Squares Versus k-Nearest Neighbor

What approach is best here?Least Squares, or k-nearest neighbor?

The answer depends on how the data was generated.

For example:

If the data arose from two multinomial Gaussians, leastsquares.If the data arose from many highly localized Gaussians, k-NN.

Expected Prediction Error

Statistical Modeling can help us navigate and judge variousmethods to analyze data


ModelWe have two random variables X and Y taking values in Rdand R, respectively.

We seek f : Rd 7→ R which intends to get to a deterministicrelationship between X and Y .

Statistical Criterium:Choose f so that it minimizes the

E[|f(X)− Y |2]

this is known as the Expected Prediction Error (EPE).


Recall the total expectation formula

E[|f(X)− Y |2] = E[E[|f(X)− Y |2 | X]]

In other words,

E[|f(X)− Y |2] = E[g(X)],

where

g(x) = E[|f(X)− Y |2 | X = x]

Expected Squared Prediction Error

This leads to the best answer, which is given by

f(x) = E[Y | X = x]

This is often known as the statistical regression function.


What about a measure other than mean square error?

If one instead decides to minimize not the square, but theabsolute value

E[|X − Y |]

Then, arguing as above, one is led to

f(x) = median[Y | X = x]



Most generally, one could consider L : R× R 7→ R, a lossfunction or error function. Then, we want to minimize

E[L(Y, f(X))]

and, the respective regression function would be defined as

f(x) := argming∈R

E[L(Y, g) | X = x]



In the context of classification, a popular loss function is thezero-one function.

The Bayes classifier, is the associated regression function

f(x) := argmaxg

P[Y = g | X = x]

Expected Squared Prediction ErrorThe return of Least Squares and k-NN

How may we estimate E[Y | X = x]?

As it turns out, both Least Squares and k-NN are effectivelyapproximating this conditional expectation –under variousregimes and assumptions, that is.


k-NN.

Fix k, and let

f(x) :=1

k

∑xi∈Nk(x)

yi

Under rather mild assumptions*, one can show that as thesample N →∞, we have

f(x)→ E[Y | X = x]

(*Recall the Law of Large Numbers!)


Least Squares.

Going back to

EPE(f) := E[|f(X)− Y |2]

Assume that the minimizer f is linear, or well approximated bya linear function. Then, it makes sense of find the minimizer ofEPE(f) within the class of linear functions.


Least Squares.

In this case, let us write

fβ(x) = β · x

and seek the minimizer of

β 7→ EPE(fβ), β ∈ Rd.


Least Squares.

This is a convex functional in β, so the minimizer is given by

∇βEPE(fβ) = 0

That is

E[(X · β − Y )X] = 0

Which may be rewritten as

(E[(X ⊗X])β = E[Y X]


Least Squares is back!

If we average over the training data, then the solution of

(E[(X ⊗X])β = E[Y X]

corresponds to least squares solution!

The moral of the story: Both k-NN and Least Squarescorrespond to the minimization of the Expected SquaredPrediction Error, the former when we expected the regressionfunction to be well approximated locally by constants, and thelatter when minimizing within affine functions only.

The first week, in one slide

1. Learning problems: supervised and unsupervised.

2. Supervised learning: Regression (quantitive) andclassification (categorical).

3. ML problems lead naturally to variational problems.

4. Least Squares yields a regression method is rigid.

5. k-Nearest Neighbors is a regression method which isflexibly, but less stable than least squares.

6. Learning methods/models involve parameters that quantifytheir complexity. One must decide what degree of variance,bias, and stability/rigidity are best for a given problem.

7. Given a statistical model for our data, minimization of EPEleads (under various regimes) to Least Squares or k-NN.

m697: analysis and machine learning fall 2017nguillen/697/697week1.pdfpart ii 1.some functional...

Documents