ieor 265 - university of california, berkeley

11
IEOR 265 – Lecture Notes Convex Geometry 1 Classifying Regression Methods Suppose we have a system in which an input (also known as predictors) X R p gives an output (also known as a response) Y R, and suppose there is a static relationship between X and Y that is given by Y = f (X )+ , where is zero mean noise with finite variance (i.e., E()=0 and var()= σ 2 < ). We will also assume that is independent of X . The process of statistical modeling involves using measured data to identify the relationship between X and Y , meaning identify E[Y |X ]= f (X ). This is a huge topic of inquiry, but in this course we will categorize this regression problem into three classes: 1. Parametric Regression – The unknown function f (X ) is characterized by a finite number of parameters. It is common to think of f (X ; β ), where β R p is a vector of unknown parameters. The simplest example is a linear model, in which we have f (X ; β )= p X j =1 β j X j . This approach is used when there is strong a priori knowledge about the structure of the system (e.g., physics, biology, etc.). 2. Nonparametric Regression – The unknown function f (X ) is characterized by an infinite number of parameters. For instance, we might want to represent f (X ) as an infinite polynomial expansion f (X )= β 0 + β 1 X + β 2 X 2 + .... This approach is used when there is little a priori knowledge about the structure of the system. Though it might seem that this approach is superior because it is more flexible than parametric regression, it turns out that one must pay a statistical penalty because of the need to estimate a greater number of parameters. 3. Semiparametric Regression – The unknown function f (X ) is characterized by a component with a finite number of parameters and another component with an infinite number of parameters. In some cases, the infinite number of parameters are known as nuisance parameters; however, in other cases this infinite component might have useful information in and of itself. A classic example is a partially linear model: f (X )= m X j =1 β j X j + g(X m+1 ,...,X k ). 1

Upload: others

Post on 14-Feb-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

IEOR 265 – Lecture NotesConvex Geometry

1 Classifying Regression Methods

Suppose we have a system in which an input (also known as predictors) X ∈ Rp gives an output(also known as a response) Y ∈ R, and suppose there is a static relationship between X and Ythat is given by Y = f(X) + ε, where ε is zero mean noise with finite variance (i.e., E(ε) = 0and var(ε) = σ2 <∞). We will also assume that ε is independent of X.

The process of statistical modeling involves using measured data to identify the relationshipbetween X and Y , meaning identify E[Y |X] = f(X). This is a huge topic of inquiry, but in thiscourse we will categorize this regression problem into three classes:

1. Parametric Regression – The unknown function f(X) is characterized by a finite numberof parameters. It is common to think of f(X; β), where β ∈ Rp is a vector of unknownparameters. The simplest example is a linear model, in which we have

f(X; β) =

p∑j=1

βjXj.

This approach is used when there is strong a priori knowledge about the structure of thesystem (e.g., physics, biology, etc.).

2. Nonparametric Regression – The unknown function f(X) is characterized by an infinitenumber of parameters. For instance, we might want to represent f(X) as an infinitepolynomial expansion

f(X) = β0 + β1X + β2X2 + . . . .

This approach is used when there is little a priori knowledge about the structure of thesystem. Though it might seem that this approach is superior because it is more flexiblethan parametric regression, it turns out that one must pay a statistical penalty because ofthe need to estimate a greater number of parameters.

3. Semiparametric Regression – The unknown function f(X) is characterized by a componentwith a finite number of parameters and another component with an infinite number ofparameters. In some cases, the infinite number of parameters are known as nuisanceparameters; however, in other cases this infinite component might have useful informationin and of itself. A classic example is a partially linear model:

f(X) =m∑j=1

βjXj + g(Xm+1, . . . , Xk).

1

Here, the g(Xm+1, . . . , Xxk) is represented nonparametrically, and the∑m

j=1 βjXj term is

the parametric component.

This categorization is quite crude because in some problems the classes can blend into eachother. For instance, high-dimensional parametric regression can be thought of as nonparametricregression. The key problem in regression is that of regularization. The idea of regularization is toimprove the statistical properties of estimates by imposing additional structure onto the model.

2 Ordinary Least Squares

Suppose that we have pairs of independent and identically distributed (i.i.d.) measurements(Xi, yi) for i = 1, . . . , n, where Xi ∈ Rp and Yi ∈ R, and that the system is described by a linearmodel

Yi =

p∑j=1

βjXji + εi = XT

i β + εi.

Ordinary least squares (OLS) is a method to estimate the unknown parameters β ∈ Rp givenour n measurements. Because the Yi are noisy measurements (whereas the Xi are not noisy inthis model), the intuitive idea is to choose an estimate β ∈ Rp which minimizes the differencebetween the measured Yi and the estimated Yi = XT

i β.

There are a number of ways that we could characterize this difference. For mathematical andcomputational reasons, a popular choice is the squared loss: This difference is quantified as∑

i(Yi− Yi)2, and the resulting problem of choosing β to minimize this difference can be cast asthe following (unconstrained) optimization problem:

β = arg minβ

n∑i=1

(Yi −XTi β)2.

For notational convenience, we will define a matrix X ∈ Rn×p and a vector Y ∈ Rn such thatthe i-th row of X is XT

i and the i-th row of Y is Yi. With this notation, the OLS problem canbe written as

β = arg minβ‖Y− Xβ‖22,

where ‖ · ‖2 is the usual L2-norm. (Recall that for a vector v ∈ Rk the L2-norm is ‖v‖2 =√(v1)2 + . . .+ (vk)2.)

Now given this notation, we can solve the above defined optimization problem. Because theproblem is unconstrained, setting the gradient of the objective to zero and solving the resultingalgebraic equation will give the solution. For notational convenience, we will use the function

2

J(X,Y; β) to refer to the objective of the above optimization problem. Computing its gradientgives

∇βJ = −2XT(Y− Xβ) = 0⇒ XTXβ = XTY⇒ β = (XTX)−1(XTY).

This is the OLS estimate of β for the linear model.

2.1 Geometric Interpretation of OLS

Recall the optimization formulation of OLS,

β = arg minβ‖Y− Xβ‖22,

where the variables are as defined before. The basic tension in the problem above is that ingeneral no exact solution exists to the linear equation

Y = Xβ;

otherwise we could use linear algebra to compute β, and this value would be a minimizer to theoptimization problem written above.

Though no exact solution exists to Y = Xβ, an interesting question to ask is whether there issome related linear equation for which an exact solution exists. Because the noise is in Y and notX, we can imagine that we would like to pick some Y such that Y = Xβ has an exact solution.Recall from linear algebra, that this is equivalent to asking that Y ∈ R(X) (i.e., Y is in the rangespace of X). Now if we think of Y as the true signal, then we can decompose Y as

Y = Y + ∆Y,

where ∆Y represents orthogonal noise. Because – from Fredholm’s theorem in linear algebra –we know that the range space of X is orthogonal to the null space of XT (i.e., R(X) ⊥ N (XT)),it must be the case that ∆Y ∈ N (XT) since we defined Y such that Y ∈ R(X). As a result,premultiplying Y = Y + ∆Y by XT gives

XTY = XTY + XT∆Y = XTY.

The intuition is that premultiplying by XT removes the noise component. And because Y ∈ R(X)and Y = Xβ, we must have that

XTY = XTY = XTXβ.

Solving this gives β = (XTX)−1(XTY), which is our regular equation for the OLS estimate.

3

2.2 Challenges with High Dimensionality

In modern data sets, it is common for p (number of predictors/parameters) to be the same orderof magnitude (or even larger) than n (number of measurements). This is highly problematicbecause the statistical error of OLS is linearly increasing in p, and inversely proportional ton. (More formally, the mean squared error of OLS scales as Op(p/n).) Consequently, OLS isnot statistically well-behaved in this modern high-dimensional setting. To be able to developestimators that work in this setting, we are required to first better understand the geometery ofhigh-dimensional convex bodies.

3 High-Dimensional Convex Bodies

Recall that K ⊆ Rp is a convex set if for all x, y ∈ K and λ ∈ [0, 1] it holds that

λ · x+ (1− λ) · y ∈ K.

In low dimensions, the volume of convex bodies is distributed evenly. A representative exampleis the two-dimensional polytope show below.

The situation is markedly different in high dimensions. Consider a convex body K, which is aconvex set in Rp that is (i) closed, (ii) bounded, and (iii) has a nonempty interior. Furthermore,suppose K is isotropic, meaning that if a random variable X is uniformly distributed on K thenit has the properties that

E(X) = 0

E(XXT) = I,

where I is the p×p identity matrix. It turns out that majority of the volume of K is concentratedabout the ball with radius

√p. More formally, for every t ≥ 1 we have

P(‖X‖2 > t√p) ≤ exp(−ct√p).

4

Moreover, the majority of volume of K is found in a thin shell around the ball of radius√p: For

every ε ∈ (0, 1), we have

P(|‖X‖2 −√p| > ε

√p) ≤ C exp(−cε3p1/2).

Note that in these two above results, C, c are positive absolute constants. As a result, a moreintuitive picture of a convex body in high dimensions is

This “hyperbolic” picture is more intuitive because it shows that the convex body can be char-acterized as consisting of a bulk and outliers, where (i) the bulk of volume is concentrated ina ball of small radius, (ii) the outliers have large radius, and (iiii) the volume of the outliers isexponentially decreasing away from the bulk.

4 Metric Entropy

The majority of volume in a high-dimensional convex body is concentrated in a small radius.However, this does not completely characterize the important properties of convex bodies. Forinstance, consider the unit `1- and `∞-balls, that is

B1 = {x ∈ Rp : ‖x‖1 ≤ 1}B∞ = {x ∈ Rp : ‖x‖∞ ≤ 1},

where ‖x‖1 =∑p

j=1 |xi| is the `1-norm and ‖x‖∞ = maxj |xi| is the `∞-norm. Though theseballs have the same unit radius, the `1-ball B1 has 2p vertices whereas the `∞-ball B∞ has 2p

vertices. In other words, the polytope B∞ is significantly more complex than B1.

Given this large discrepancy in the complexity of these two balls, it is natural to ask whetherit is possible to define some measure of complexity for general convex bodies? In fact, there isa broader question of whether it is possible to define some measure of complexity for arbitrarysubsets of Rp? We cannot simply count the number of vertices of the subsets, because in generalthese sets will not be polytopes. It turns out that the answer to the above questions is anemphatic “yes”. The situation is in fact more complex, because it turns out that there are severalinteresting notions of complexity for subsets of Rp.

5

4.1 Definition of Metric Entropy

One simple notion of complexity is known as the “covering number” or “metric entropy” (theyare related by a logarithm). Given a set T ⊂ Rp and L ⊂ Rp, the covering number N(T, L) isthe minimum number of translates of L needed to cover T . Then, the metric entropy is definedas logN(T, L). An important special case is when L = εB2, where B2 = {x ∈ Rp : ‖x‖2 ≤ 1}is the unit `2-ball. (Note that εB2 = {x ∈ Rp : ‖x‖2 ≤ ε}.) Unfortunately, it is difficult to getaccurate bounds on the covering number or metric entropy for an arbitrary set T .

4.2 Covering Number Estimate using Volume

One approach to bounding the covering number is to compare the volumes of the sets T, L. Inparticular, we have that

vol(T )

vol(L)≤ N(T, L) ≤

vol(T ⊕ 12L)

vol(12L)

,

where U ⊕ V = {u + v : u ∈ U, v ∈ V } is the Minkowski sum. A useful corollary of this resultis that if T is a symmetric convex set, then for every ε > 0 we have(1

ε

)p≤ N(T, εT ) ≤

(2 +

1

ε

)p.

4.3 Example: Covering Number of `2-Ball

Consider the unit `2-ball B2 = {x : ‖x‖2 ≤ 1}. Then the covering number of B2 by εB2 isbounded by (1

ε

)p≤ N(B2, εB2) ≤

(2 +

1

ε

)p.

For the `2-ball with radius λ, which is λB2, we thus have that the covering number by εB2 isbounded by (λ

ε

)p≤ N(λB2, εB2) = N(B2, (ε/λ)B2) ≤

(2 +

λ

ε

)p.

This is exponential in p.

5 Gaussian Average

Another interesting notion of complexity is known as a “Gaussian average” or “Gaussian width”.Given a set T ⊂ Rp, we define the Gaussian average as

w(T ) = E(

supt∈T

gTt),

where g ∈ Rp is a Gaussian random vector with zero-mean and identity covariance matrix (orequivalently a random vector where each entry is an iid Gaussian random variable with zero-meanand unit variance). Unfortunately, computing the Gaussian average for a given set T can be

6

difficult unless T has some simple structure.

5.1 Invariance Under Convexification

One of the most important properties (from the standpoint of high-dimensional statistics) of theGaussian average is that

w(conv(T )) = w(T ),

where conv(T ) denotes the convex hull of T . The proof of this is simple and instructive: First,note that w(conv(T )) ≥ w(T ), since T ⊆ conv(T ). Second, note that if t ∈ conv(T ) then byCaratheodory’s theorem it can be represented as t =

∑p+1j=1 µjtj where µj ∈ [0, 1],

∑j µj = 1,

and tj ∈ T . As a result, we have

w(conv(T )) = E(

supµj∈[0,1],

∑j µj=1,tj∈T

gT(∑p+1

j=1 µjtj))

≤ E(

supµj∈[0,1],

∑j µj=1,tj∈T

maxjgTtj

)= E

(suptj∈T

maxjgTtj

)= E

(supt∈T

gTt)

= w(T ).

The first inequality follows because gT(∑p+1

j=1 µjtj) is linear in the µj, and because we have thatµj ∈ [0, 1] and

∑j µj = 1. Since we have shown that w(conv(T )) ≥ w(T ) and w(conv(T )) ≤

w(T ), it must be the case that w(conv(T )) = w(T ). Note that this equivalence also follows bynoting that the maximum of a convex function over a closed convex set is attained at an extremepoint of the convex set.

5.2 Sudakov’s Minoration

It turns out there is a relationship between the Gaussian average and the metric entropy of a set.One such relationship is known as Sudakov’s minoration. In particular, if T is a symmetric set(i.e., if t ∈ T , then −t ∈ T ), then we have√

logN(T, εB2) ≤ C · w(T )

ε,

where C is an absolute constant. This is a useful relationship because it allows us to upper boundthe metric entropy of a set if we can compute its Gaussian average.

5.3 Example: Gaussian Average of `2-Balls

Consider the set λB2 = {x ∈ Rp : ‖x‖2 ≤ λ}. Its Gaussian average is defined as

w(λB2) = E(

supt∈λB2

gTt).

7

The quantity gTt is maximized whenever t is in the direction of g (i.e., t ∼ g‖g‖2 ) and has length

‖t‖2 = λ. Thus, the quantity is maximized for t = λ · g‖g‖2 . As a result, the Gaussian average is

w(λB2) = λ · E(‖g‖2).

Since ‖g‖2 has a chi-distribution (i.e., square root of a chi-squared distribution), standard resultsabout this distribution give that

cλ√p ≤ w(λB2) ≤ λ

√p,

where c is an absolute positive constant. Finally, we can use Sudakov’s minoration to upperbound the covering number:

N(λB2, εB2) ≤ exp(C2λ2p

ε2

).

This is exponential in p, just like the previous bound.

5.4 Example: Gaussian Average of `1-Balls

Consider the set λB1 = {x ∈ Rp : ‖x‖1 ≤ λ}. Its Gaussian average is defined as

w(λB1) = E(

supt∈λB1

gTt).

By Holder’s inequality, we know that gTt ≤ |gTt| ≤ ‖g‖∞ · ‖t‖1. Thus, we can bound theGaussian average by

w(λB1) = E(

supt∈λB1

gTt)≤ E

(supt∈λB1

‖g‖∞ · ‖t‖1)

= λ · E(

maxj|gj|).

Hence, the question is how to upper bound E(

maxj |gj|)

.

1. The first observation is that

maxj|gj| = max

(maxjgj,max

j−gj

).

2. The second observation is that if Vi for i = 1, . . . , n are (not necessarily independent)Gaussian random variables with zero-mean and variance σ2, then we have

exp(uE(maxiVi)) ≤ E(exp(u ·max

iVi)) =

Emaxi

exp(uVi) ≤n∑i=1

E(exp(uVi)) ≤ n exp(u2σ2/2),

8

where the first inequality follows from Jensen’s inequality and the last equality is an identityfrom the definition of the moment generating function of a Gaussian. Taking the logarithmof both sides of the above inequality, and then rearranging terms gives

E(maxiVi) ≤

log n

u+uσ2

2.

We can tighten this bound by choosing u to minimize the right hand side of the aboveequation. The minimizing value makes the derivative equal to zero, meaning

− log n

u2+σ2

2= 0⇒ u =

√2 log n/σ.

Substituting this value of u into the upper bound yields

E(maxiVi) ≤ σ

√2 log n.

Since σ2 = 1 for gj and −gj, combining these two observations gives that

E(

maxj|gj|)≤√

2 log 2p ≤√

2 log 2 + 2 log p ≤√

4 log p,

whenever p ≥ 2.

Consequently, we have that the Gaussian average of the `1-ball is

w(λB1) ≤ λ ·√

4 log p.

whenever p ≥ 2. We can use Sudakov’s minoration to upper bound the covering number:

N(λB1, εB2) ≤ exp(C2λ2 · 4 log p

ε2

)= (exp(log p))4C

2λ2/ε2 = pcλ2/ε2 ,

where c is an absolute constant. Interestingly, this is polynomial in p, which is in contrast to thecovering number of λB2.

5.5 Example: Gaussian Average of `∞-Balls

Consider the set λB∞ = {x ∈ Rp : ‖x‖∞ ≤ λ}. Its Gaussian average is defined as

w(λB∞) = E(

supt∈λB∞

gTt).

By Holder’s inequality, we know that gTt ≤ |gTt| ≤ ‖g‖1 · ‖t‖∞. Thus, we can bound theGaussian average by

w(λB∞) = E(

supt∈λB∞

gTt)≤ E

(supt∈λB∞

‖g‖1 · ‖t‖∞)

= λ · E(‖g‖1

).

9

But note that ‖g‖1 = maxsi∈{−1,1}∑p

i=1 sigi. Since the gi are iid Gaussians with zero-mean andunit variance, for fixed si the quantity

∑pi=1 sigi is a Gaussian with zero mean and variance p.

Thus, using our earlier bound gives

E(‖g‖1) = E( maxsi∈{−1,1}

p∑i=1

sigi) ≤√

4p log 2p =√

4 log 2 · p2.

Thus, the Gaussian average isw(λB∞) ≤

√4 log 2λ · p.

We can use Sudakov’s minoration to upper bound the covering number:

N(λB∞, εB2) ≤ exp(C2λ2 · 4 log 2 · p2

ε2

),

where c is an absolute constant. This is exponential in p.

6 Rademacher Average

Another notion of complexity is known as a “Rademacher average” or “Rademacher width”.Given a set T ⊂ Rp, we define the Rademacher average as

r(T ) = E(

supt∈T

εTt),

where ε ∈ Rp is a Rademacher random vector, meaning each component εi is independent and

P(εi = ±1) =1

2.

Similar to the Gaussian average, computing the Rademacher average for a given set T can bedifficult unless T has some simple structure. Though many results for Gaussian averages haveanalogs for Rademacher averages, Rademacher averages are more difficult to work with becausemany comparison results for Gaussian averages do not have analogs for Rademacher averages.However, it is the case that Rademacher and Gaussian averages are equivalent in the sense that

c · r(T ) ≤ w(T ) ≤ C · log p · r(T ),

where c, C are absolute constants.

7 More Reading

The material in the last sections follows that of

10

R. Vershynin (2015) Estimation in high dimensions: a geometric perspective, in Sampling Theory,a Renaissance. Springer.

R. Vershynin (2018) High-Dimensional Probability. Cambridge University Press.

M. Ledoux and M. Talagrand (1991) Probability in Banach Spaces. Springer.

11