stochastic optimization of multiple objectives and supervised machine...

88
Stochastic Optimization of Multiple Objectives and Supervised Machine Learning Luis Nunes Vicente ISE, Lehigh University Dept. of Industrial Engineering, Univ. of Pittsburgh November 14, 2019

Upload: others

Post on 23-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

Stochastic Optimization of Multiple Objectivesand

Supervised Machine Learning

Luis Nunes Vicente

ISE, Lehigh University

Dept. of Industrial Engineering, Univ. of Pittsburgh

November 14, 2019

Page 2: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

1 Introduction to optimization models in Data Science and Learning

2 Stochastic gradient descent for Stochastic Optimization

3 Multi-Objective Optimization

4 Stochastic Multi-Objective OptimizationThe stochastic multi-gradient algorithm and assumptionsConvergence rates in the strongly convex and convex cases

5 Implementation and numerical results

6 Conclusions and future directions

LNV 2/49

Page 3: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

Optimization models in Data Science and Learning

A data set for analysis involving optimization is typically of the form

D = (aj , yj), j = 1, . . . , N

where the aj ’s vectors are features or attributes

the yj ’s vectors are labels or observation or responses.

The analysis consists of finding a prediction function φ(aj) such that

φ(aj) ' yj , j = 1, . . . , N

in some optimal sense.

Often, φ is parameterized φ(x) = φ(a;x). The parameters are x.

LNV Introduction to optimization models in Data Science and Learning 3/49

Page 4: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

Optimization models in Data Science and Learning

A data set for analysis involving optimization is typically of the form

D = (aj , yj), j = 1, . . . , N

where the aj ’s vectors are features or attributes

the yj ’s vectors are labels or observation or responses.

The analysis consists of finding a prediction function φ(aj) such that

φ(aj) ' yj , j = 1, . . . , N

in some optimal sense.

Often, φ is parameterized φ(x) = φ(a;x). The parameters are x.

LNV Introduction to optimization models in Data Science and Learning 3/49

Page 5: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

Optimization models in Data Science and Learning

A data set for analysis involving optimization is typically of the form

D = (aj , yj), j = 1, . . . , N

where the aj ’s vectors are features or attributes

the yj ’s vectors are labels or observation or responses.

The analysis consists of finding a prediction function φ(aj) such that

φ(aj) ' yj , j = 1, . . . , N

in some optimal sense.

Often, φ is parameterized φ(x) = φ(a;x). The parameters are x.

LNV Introduction to optimization models in Data Science and Learning 3/49

Page 6: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

Notes:

1 The process of finding φ is called learning or training.

2 When the yj ’s are reals, one has a regression problem.

3 When the yj ’s lie in a finite set 1, . . . ,M, one has a classificationproblem.

M = 2 leads to binary classification.

4 The labels may be null. In that case, one may want:

to group the aj ’s in clusters (clusterization)

or to identify a low-dimensional subspace (or a collection of)where the aj ’s lie (subspace identification).

The labels may have to be learned while learning φ.

LNV Introduction to optimization models in Data Science and Learning 4/49

Page 7: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

Notes:

1 The process of finding φ is called learning or training.

2 When the yj ’s are reals, one has a regression problem.

3 When the yj ’s lie in a finite set 1, . . . ,M, one has a classificationproblem.

M = 2 leads to binary classification.

4 The labels may be null. In that case, one may want:

to group the aj ’s in clusters (clusterization)

or to identify a low-dimensional subspace (or a collection of)where the aj ’s lie (subspace identification).

The labels may have to be learned while learning φ.

LNV Introduction to optimization models in Data Science and Learning 4/49

Page 8: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

Notes:

1 The process of finding φ is called learning or training.

2 When the yj ’s are reals, one has a regression problem.

3 When the yj ’s lie in a finite set 1, . . . ,M, one has a classificationproblem.

M = 2 leads to binary classification.

4 The labels may be null. In that case, one may want:

to group the aj ’s in clusters (clusterization)

or to identify a low-dimensional subspace (or a collection of)where the aj ’s lie (subspace identification).

The labels may have to be learned while learning φ.

LNV Introduction to optimization models in Data Science and Learning 4/49

Page 9: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

Notes:

1 The process of finding φ is called learning or training.

2 When the yj ’s are reals, one has a regression problem.

3 When the yj ’s lie in a finite set 1, . . . ,M, one has a classificationproblem.

M = 2 leads to binary classification.

4 The labels may be null. In that case, one may want:

to group the aj ’s in clusters (clusterization)

or to identify a low-dimensional subspace (or a collection of)where the aj ’s lie (subspace identification).

The labels may have to be learned while learning φ.

LNV Introduction to optimization models in Data Science and Learning 4/49

Page 10: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

Notes:

1 The process of finding φ is called learning or training.

2 When the yj ’s are reals, one has a regression problem.

3 When the yj ’s lie in a finite set 1, . . . ,M, one has a classificationproblem.

M = 2 leads to binary classification.

4 The labels may be null. In that case, one may want:

to group the aj ’s in clusters (clusterization)

or to identify a low-dimensional subspace (or a collection of)where the aj ’s lie (subspace identification).

The labels may have to be learned while learning φ.

LNV Introduction to optimization models in Data Science and Learning 4/49

Page 11: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

5 Data is assumed to be clean for optimization, but still:

i) (aj , yj)’s could be noisy or corrupted.

ii) Some aj or yj ’s could be missing.

iii) Data could arrive in streaming fashion (φ must be learned online).

Thus, φ has to be robust to changes in the data set.

Such data analysis is often referred to as machine learning or data mining.

unsupervised learning (whenlabels are null): extract inter-esting information from data

predictive or supervised learn-ing (when labels exist)

LNV Introduction to optimization models in Data Science and Learning 5/49

Page 12: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

5 Data is assumed to be clean for optimization, but still:

i) (aj , yj)’s could be noisy or corrupted.

ii) Some aj or yj ’s could be missing.

iii) Data could arrive in streaming fashion (φ must be learned online).

Thus, φ has to be robust to changes in the data set.

Such data analysis is often referred to as machine learning or data mining.

unsupervised learning (whenlabels are null): extract inter-esting information from data

predictive or supervised learn-ing (when labels exist)

LNV Introduction to optimization models in Data Science and Learning 5/49

Page 13: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

5 Data is assumed to be clean for optimization, but still:

i) (aj , yj)’s could be noisy or corrupted.

ii) Some aj or yj ’s could be missing.

iii) Data could arrive in streaming fashion (φ must be learned online).

Thus, φ has to be robust to changes in the data set.

Such data analysis is often referred to as machine learning or data mining.

unsupervised learning (whenlabels are null): extract inter-esting information from data

predictive or supervised learn-ing (when labels exist)

LNV Introduction to optimization models in Data Science and Learning 5/49

Page 14: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

The correct/accurate match of the data is typically quantified by a lossfunction `(a, y;φ(x)) and thus learning can be formulated as

minx

N∑j=1

`(aj , yj ;φ(x))

To avoid overfitting such a model to a sample D, one often adds aregularizer

minx

N∑j=1

`(aj , yj ;φ(x)) + g(x)

λ2‖x‖

22λ‖x‖1

so that φ is not so sensitive to changes in D.

LNV Introduction to optimization models in Data Science and Learning 6/49

Page 15: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

The correct/accurate match of the data is typically quantified by a lossfunction `(a, y;φ(x)) and thus learning can be formulated as

minx

N∑j=1

`(aj , yj ;φ(x))

To avoid overfitting such a model to a sample D, one often adds aregularizer

minx

N∑j=1

`(aj , yj ;φ(x)) + g(x)

λ2‖x‖

22λ‖x‖1

so that φ is not so sensitive to changes in D.

LNV Introduction to optimization models in Data Science and Learning 6/49

Page 16: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

The general form of optimization models in DS is

minx∈Rn

f(x) + g(x)

where f : Rn → R is smooth.

∇f is at least Lipschitz continuous.

g : Rn → R ∪ +∞ is convex, proper and closed.

g is typically non-smooth.

LNV Introduction to optimization models in Data Science and Learning 7/49

Page 17: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

The general form of optimization models in DS is

minx∈Rn

f(x) + g(x)

where f : Rn → R is smooth.

∇f is at least Lipschitz continuous.

g : Rn → R ∪ +∞ is convex, proper and closed.

g is typically non-smooth.

LNV Introduction to optimization models in Data Science and Learning 7/49

Page 18: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

Classical example: logistic regression for binary classification

Consider a data set with y ∈ −1, 1

a1

a2

φ(x)

=r> a−b

y = 1

y = −1

φ(x) = φ(a; r, b) = r>a− b is a linear classifier. One aims to seek for aoptimal parameter x = (r, b) such that

r>aj − b ≥ 0 when yj = 1r>aj − b < 0 when yj = −1

∀j = 1, . . . , N (∗)

LNV Introduction to optimization models in Data Science and Learning 8/49

Page 19: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

Classical example: logistic regression for binary classification

Consider a data set with y ∈ −1, 1

a1

a2

φ(x)

=r> a−b

y = 1

y = −1

φ(x) = φ(a; r, b) = r>a− b is a linear classifier. One aims to seek for aoptimal parameter x = (r, b) such that

r>aj − b ≥ 0 when yj = 1r>aj − b < 0 when yj = −1

∀j = 1, . . . , N (∗)

LNV Introduction to optimization models in Data Science and Learning 8/49

Page 20: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

Classical example: logistic regression for binary classification

Consider a data set with y ∈ −1, 1

a1

a2

φ(x)

=r> a−b

y = 1

y = −1

φ(x) = φ(a; r, b) = r>a− b is a linear classifier. One aims to seek for aoptimal parameter x = (r, b) such that

r>aj − b ≥ 0 when yj = 1r>aj − b < 0 when yj = −1

∀j = 1, . . . , N (∗)

LNV Introduction to optimization models in Data Science and Learning 8/49

Page 21: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

Logistic regression can be motivated by two arguments.

1 Smoothing/convexifying the true loss.

What we want to do is

min 1(sign(z) 6= y) where z = φ(x).

y = +1

1(sign(z) 6= +1)

z

LNV Introduction to optimization models in Data Science and Learning 9/49

Page 22: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

Smoothing/convexifying to Hinge loss (SVM)

y = +1

Hinge

max0,−(+1)z

z

Smoothing to Logistic loss

y = +1

Logistic

log(1 + e−(+1)z)

z

LNV Introduction to optimization models in Data Science and Learning 10/49

Page 23: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

2 Probabilistic argument: (∗) is equivalent to

σ(φ(x)) =1

1 + e−φ(x)

≥ 0.5 when y = 1< 0.5 when y = −1

0.5

1

z

Think of it as the probability

P (y = 1|a) = σ(φ(x)) = 11+e−(+1)φ(x)

P (y = −1|a) = 1− σ(φ(x)) = 11+e−(−1)φ(x)

11+e−z

Assuming data points are i.i.d., the maximum likelihood problem is

maxN∏j=1

P (yj |aj)︸ ︷︷ ︸prob. of correct prediction

=N∏j=1

1

1 + e−yjφ(x)

LNV Introduction to optimization models in Data Science and Learning 11/49

Page 24: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

2 Probabilistic argument: (∗) is equivalent to

σ(φ(x)) =1

1 + e−φ(x)

≥ 0.5 when y = 1< 0.5 when y = −1

0.5

1

z

Think of it as the probability

P (y = 1|a) = σ(φ(x)) = 11+e−(+1)φ(x)

P (y = −1|a) = 1− σ(φ(x)) = 11+e−(−1)φ(x)

11+e−z

Assuming data points are i.i.d., the maximum likelihood problem is

maxN∏j=1

P (yj |aj)︸ ︷︷ ︸prob. of correct prediction

=N∏j=1

1

1 + e−yjφ(x)

LNV Introduction to optimization models in Data Science and Learning 11/49

Page 25: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

2 Probabilistic argument: (∗) is equivalent to

σ(φ(x)) =1

1 + e−φ(x)

≥ 0.5 when y = 1< 0.5 when y = −1

0.5

1

z

Think of it as the probability

P (y = 1|a) = σ(φ(x)) = 11+e−(+1)φ(x)

P (y = −1|a) = 1− σ(φ(x)) = 11+e−(−1)φ(x)

11+e−z

Assuming data points are i.i.d., the maximum likelihood problem is

max

N∏j=1

P (yj |aj)︸ ︷︷ ︸prob. of correct prediction

=

N∏j=1

1

1 + e−yjφ(x)

LNV Introduction to optimization models in Data Science and Learning 11/49

Page 26: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

Then, the negative log likelihood leads to the loss minimization problem

minr,b

1

N

N∑j=1

`L(aj , yj ;φ(x))

with the smooth convex logistic loss function

`L(aj , yj ;φ(x)) = log(1 + e−yj(r>aj−b))

Binary classification is formulated as the optimization problem

minr,b

1

N

N∑j=1

`L(aj , yj ;φ(x)) +λ

2‖r‖22

LNV Introduction to optimization models in Data Science and Learning 12/49

Page 27: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

Then, the negative log likelihood leads to the loss minimization problem

minr,b

1

N

N∑j=1

`L(aj , yj ;φ(x))

with the smooth convex logistic loss function

`L(aj , yj ;φ(x)) = log(1 + e−yj(r>aj−b))

Binary classification is formulated as the optimization problem

minr,b

1

N

N∑j=1

`L(aj , yj ;φ(x)) +λ

2‖r‖22

LNV Introduction to optimization models in Data Science and Learning 12/49

Page 28: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

Presentation outline

1 Introduction to optimization models in Data Science and Learning

2 Stochastic gradient descent for Stochastic Optimization

3 Multi-Objective Optimization

4 Stochastic Multi-Objective OptimizationThe stochastic multi-gradient algorithm and assumptionsConvergence rates in the strongly convex and convex cases

5 Implementation and numerical results

6 Conclusions and future directions

Page 29: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

Stochastic gradient descent for Stochastic Optimization

Assume a point (a, y) ∈ D arises with a certain (joint) probability PΩ.

The general goal now is to minimize the expected risk of misclassification

R(x) =

∫`(a, y;φ(x))dPΩ(a, y)

When PΩ is unknown, only having a sample D leads to minimizing the

empirical risk of misclassification

RN (x) =1

N

N∑j=1

`(aj , yj ;φ(x))

as an estimation of R(x).

LNV Stochastic gradient descent for Stochastic Optimization 14/49

Page 30: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

Stochastic gradient descent for Stochastic Optimization

Assume a point (a, y) ∈ D arises with a certain (joint) probability PΩ.

The general goal now is to minimize the expected risk of misclassification

R(x) =

∫`(a, y;φ(x))dPΩ(a, y)

When PΩ is unknown, only having a sample D leads to minimizing the

empirical risk of misclassification

RN (x) =1

N

N∑j=1

`(aj , yj ;φ(x))

as an estimation of R(x).

LNV Stochastic gradient descent for Stochastic Optimization 14/49

Page 31: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

Stochastic gradient descent for Stochastic Optimization

Assume a point (a, y) ∈ D arises with a certain (joint) probability PΩ.

The general goal now is to minimize the expected risk of misclassification

R(x) =

∫`(a, y;φ(x))dPΩ(a, y)

When PΩ is unknown, only having a sample D leads to minimizing the

empirical risk of misclassification

RN (x) =1

N

N∑j=1

`(aj , yj ;φ(x))

as an estimation of R(x).

LNV Stochastic gradient descent for Stochastic Optimization 14/49

Page 32: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

For simplicity, let f(x) = `(a, y;φ(x)).

w is the random seed variable representing a sample. (A set of realizationsw[j]Nj=1 corresponds to a sample set (aj , yj), j = 1, . . . , N.)

The general objective could be written as

R(x) =

R(x) = E[f(x;w)] Online setting (expected risk)

RN (x) = 1N

N∑j=1

fj(x) Finite sum setting (empirical risk)

where fj(x) = f(x;w[j]) is the loss associated with the j-th sample.

Both can be minimized by the stochastic gradient method.

LNV Stochastic gradient descent for Stochastic Optimization 15/49

Page 33: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

For simplicity, let f(x) = `(a, y;φ(x)).

w is the random seed variable representing a sample. (A set of realizationsw[j]Nj=1 corresponds to a sample set (aj , yj), j = 1, . . . , N.)

The general objective could be written as

R(x) =

R(x) = E[f(x;w)] Online setting (expected risk)

RN (x) = 1N

N∑j=1

fj(x) Finite sum setting (empirical risk)

where fj(x) = f(x;w[j]) is the loss associated with the j-th sample.

Both can be minimized by the stochastic gradient method.

LNV Stochastic gradient descent for Stochastic Optimization 15/49

Page 34: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

For simplicity, let f(x) = `(a, y;φ(x)).

w is the random seed variable representing a sample. (A set of realizationsw[j]Nj=1 corresponds to a sample set (aj , yj), j = 1, . . . , N.)

The general objective could be written as

R(x) =

R(x) = E[f(x;w)] Online setting (expected risk)

RN (x) = 1N

N∑j=1

fj(x) Finite sum setting (empirical risk)

where fj(x) = f(x;w[j]) is the loss associated with the j-th sample.

Both can be minimized by the stochastic gradient method.

LNV Stochastic gradient descent for Stochastic Optimization 15/49

Page 35: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

For simplicity, let f(x) = `(a, y;φ(x)).

w is the random seed variable representing a sample. (A set of realizationsw[j]Nj=1 corresponds to a sample set (aj , yj), j = 1, . . . , N.)

The general objective could be written as

R(x) =

R(x) = E[f(x;w)] Online setting (expected risk)

RN (x) = 1N

N∑j=1

fj(x) Finite sum setting (empirical risk)

where fj(x) = f(x;w[j]) is the loss associated with the j-th sample.

Both can be minimized by the stochastic gradient method.

LNV Stochastic gradient descent for Stochastic Optimization 15/49

Page 36: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

The stochastic gradient method

Algorithm 1 Stochastic Gradient (SG) method (Robbins and Monro 1951)

1: Choose an initial point x0 ∈ Rn.2: for k = 0, 1, . . . do3: Compute a stochastic gradient g(xk;wk).4: Choose a step-size αk > 0.5: Set the new iterate xk+1 = xk − αkg(xk;wk).6: end for

Sutton MonroLehigh ISE

The stochastic gradient g(xk;wk) could be

∇f(xk;wk) = ∇fjk(xk) simple or basic SG

1

|Bk|∑j∈Bk

∇f(xk;wk,j) mini-batch SG (|Bk| is the mini-batch size)

LNV Stochastic gradient descent for Stochastic Optimization 16/49

Page 37: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

The stochastic gradient method

Algorithm 1 Stochastic Gradient (SG) method (Robbins and Monro 1951)

1: Choose an initial point x0 ∈ Rn.2: for k = 0, 1, . . . do3: Compute a stochastic gradient g(xk;wk).4: Choose a step-size αk > 0.5: Set the new iterate xk+1 = xk − αkg(xk;wk).6: end for

Sutton MonroLehigh ISE

The stochastic gradient g(xk;wk) could be

∇f(xk;wk) = ∇fjk(xk) simple or basic SG

1

|Bk|∑j∈Bk

∇f(xk;wk,j) mini-batch SG (|Bk| is the mini-batch size)

LNV Stochastic gradient descent for Stochastic Optimization 16/49

Page 38: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

The stochastic gradient method

Algorithm 1 Stochastic Gradient (SG) method (Robbins and Monro 1951)

1: Choose an initial point x0 ∈ Rn.2: for k = 0, 1, . . . do3: Compute a stochastic gradient g(xk;wk).4: Choose a step-size αk > 0.5: Set the new iterate xk+1 = xk − αkg(xk;wk).6: end for

Sutton MonroLehigh ISE

The stochastic gradient g(xk;wk) could be

∇f(xk;wk) = ∇fjk(xk) simple or basic SG

1

|Bk|∑j∈Bk

∇f(xk;wk,j) mini-batch SG (|Bk| is the mini-batch size)

LNV Stochastic gradient descent for Stochastic Optimization 16/49

Page 39: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

Classical assumptions for SG method

1 The sequence of realizations wk are i.i.d. sample of w.

2 Function R has Lipschitz continuous gradient.

3 The stochastic gradient is an unbiased estimate

Ewk[g(xk;wk)] = ∇R(xk)

4 The stochastic gradient has bounded variance

Vwk[g(xk;wk)] ≤ M +O(‖∇R(xk)‖2)

5 Function R is bounded below: R(x) ≥ R∗, ∀x. R∗ = R(x∗) where x∗is a minimizer of R.

E[·] denotes the expected value taken w.r.t. the joint distribution of wk.

LNV Stochastic gradient descent for Stochastic Optimization 17/49

Page 40: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

Convergence rates for SG method

The strongly convex case (Sacks 1958)

Consider a diminishing step-size sequence (e.g., αk ' O(1/k)), one has

E[R(xk)]−R∗ ≤ O(1/k)

The convex case (Nemirovski and Yudin 1978)

Consider a diminishing step-size sequence (e.g., αk ' O(1/√k)), one has

E[Rkbest]−R∗ ≤ O(1/√k)

where Rkbest = min1≤i≤k R(xi).

LNV Stochastic gradient descent for Stochastic Optimization 18/49

Page 41: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

Convergence rates for SG method

The strongly convex case (Sacks 1958)

Consider a diminishing step-size sequence (e.g., αk ' O(1/k)), one has

E[R(xk)]−R∗ ≤ O(1/k)

The convex case (Nemirovski and Yudin 1978)

Consider a diminishing step-size sequence (e.g., αk ' O(1/√k)), one has

E[Rkbest]−R∗ ≤ O(1/√k)

where Rkbest = min1≤i≤k R(xi).

LNV Stochastic gradient descent for Stochastic Optimization 18/49

Page 42: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

Presentation outline

1 Introduction to optimization models in Data Science and Learning

2 Stochastic gradient descent for Stochastic Optimization

3 Multi-Objective Optimization

4 Stochastic Multi-Objective OptimizationThe stochastic multi-gradient algorithm and assumptionsConvergence rates in the strongly convex and convex cases

5 Implementation and numerical results

6 Conclusions and future directions

Page 43: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

Multi-Objective Optimization

Multi-Objective Optimization (MOO) deals with multiple potentiallyconflicting objectives simultaneously.

We consider smooth MOO problems of the general form

min H(x) = (h1(x), . . . , hm(x))

s.t. x ∈ X

where hi : Rn → R are smooth functions and X is a feasible region.

Definition of optimality for MOO

x dominates y if H(x) < H(y) componentwise, ∀x, y ∈ X .

x is a Pareto minimizer if it is not dominated by any other point in X .

Also called a non-dominated or efficient point.

LNV Multi-Objective Optimization 20/49

Page 44: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

Multi-Objective Optimization

Multi-Objective Optimization (MOO) deals with multiple potentiallyconflicting objectives simultaneously.

We consider smooth MOO problems of the general form

min H(x) = (h1(x), . . . , hm(x))

s.t. x ∈ X

where hi : Rn → R are smooth functions and X is a feasible region.

Definition of optimality for MOO

x dominates y if H(x) < H(y) componentwise, ∀x, y ∈ X .

x is a Pareto minimizer if it is not dominated by any other point in X .

Also called a non-dominated or efficient point.

LNV Multi-Objective Optimization 20/49

Page 45: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

Multi-Objective Optimization

Multi-Objective Optimization (MOO) deals with multiple potentiallyconflicting objectives simultaneously.

We consider smooth MOO problems of the general form

min H(x) = (h1(x), . . . , hm(x))

s.t. x ∈ X

where hi : Rn → R are smooth functions and X is a feasible region.

Definition of optimality for MOO

x dominates y if H(x) < H(y) componentwise, ∀x, y ∈ X .

x is a Pareto minimizer if it is not dominated by any other point in X .

Also called a non-dominated or efficient point.

LNV Multi-Objective Optimization 20/49

Page 46: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

Pareto front

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0f_1

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

f_2

Figure: Pareto front of problem SP1

Denote P as the set of Paretominimizers.

Pareto front is a mapping from P tofunction value space Rm, i.e.,

H(P) = H(x) : x ∈ P

The goal of MOO: find the set of Pareto minimizers and hence Paretofront to define the best trade-off among several competing criteria.

LNV Multi-Objective Optimization 21/49

Page 47: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

Pareto front

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0f_1

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

f_2

Figure: Pareto front of problem SP1

Denote P as the set of Paretominimizers.

Pareto front is a mapping from P tofunction value space Rm, i.e.,

H(P) = H(x) : x ∈ P

The goal of MOO: find the set of Pareto minimizers and hence Paretofront to define the best trade-off among several competing criteria.

LNV Multi-Objective Optimization 21/49

Page 48: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

Necessary condition for Pareto minimizers

Pareto first-order stationary condition: x is a Pareto stationary point if

@ d ∈ Rn such that

∇h1(x)>

...∇hm(x)>

d < 0

Or equivalently, if the convex hull of ∇hi(x)’s contains the origin

∃λ ∈ ∆m such thatm∑i=1

λi∇hi(x) = 0

where ∆m = λ :∑m

i=1 λi = 1, λi ≥ 0,∀i = 1, ...,m is a simplex set.

Note: when all the functions are convex, x ∈ P iff x is Pareto first-orderstationary.

LNV Multi-Objective Optimization 22/49

Page 49: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

Necessary condition for Pareto minimizers

Pareto first-order stationary condition: x is a Pareto stationary point if

@ d ∈ Rn such that

∇h1(x)>

...∇hm(x)>

d < 0

Or equivalently, if the convex hull of ∇hi(x)’s contains the origin

∃λ ∈ ∆m such thatm∑i=1

λi∇hi(x) = 0

where ∆m = λ :∑m

i=1 λi = 1, λi ≥ 0, ∀i = 1, ...,m is a simplex set.

Note: when all the functions are convex, x ∈ P iff x is Pareto first-orderstationary.

LNV Multi-Objective Optimization 22/49

Page 50: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

Necessary condition for Pareto minimizers

Pareto first-order stationary condition: x is a Pareto stationary point if

@ d ∈ Rn such that

∇h1(x)>

...∇hm(x)>

d < 0

Or equivalently, if the convex hull of ∇hi(x)’s contains the origin

∃λ ∈ ∆m such thatm∑i=1

λi∇hi(x) = 0

where ∆m = λ :∑m

i=1 λi = 1, λi ≥ 0, ∀i = 1, ...,m is a simplex set.

Note: when all the functions are convex, x ∈ P iff x is Pareto first-orderstationary.

LNV Multi-Objective Optimization 22/49

Page 51: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

Overview of methods for MOO

Methods with a priori preferences (Ehrgott 2005; Miettinen 2012)

Weighted-sum method

min S(x; a) =∑m

i=1 aifi(x), where a ∈ ∆m

Limitations: hard to preselect weights for different magnitudes;cannot find Pareto minimizers in non-convex regions.

ε-constraint method

min fi(x) s.t. fj(x) ≤ εj , ∀j 6= i

where εj ≥ minx∈X fj(x) are upper bounds.

Limitations: inappropriate upper bounds lead to infeasibility.

Common issues: output single nondominated point at one run, producepoorly distributed Pareto front with multiple runs.

LNV Multi-Objective Optimization 23/49

Page 52: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

Overview of methods for MOO

Methods with a posteriori preferences

Population-based heuristic methods: no convergence proofs, e.g.,

NSGA-II (genetic algorithms) (Deb et al. 2002).

AMOSA (simulated annealing) (Bandyopadhyay et al. 2008).

Convergent methods: proved convergence to Pareto stationary points

multi-gradient method (Fliege and Svaiter 2000).

Newton’s method for MOO (Fliege et al. 2009).

direct multi-search algorithm (Custodio et al. 2011), etc.

Superiority: able to construct the whole well-spread Pareto front.

LNV Multi-Objective Optimization 24/49

Page 53: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

”Steepest” common descent direction

The multi-gradient algorithm iterates: xk+1 = xk + αkdk.

Subproblem 1 (Fliege and Svaiter 2000):

dk ∈ argmind∈Rn

max1≤i≤m

−∇hi(xk)>d+1

2‖d‖2

Note: dk = 0 ∈ Rn iff xk is Pareto stationary.

Subproblem 2: dual problem of Subproblem 1

λk ∈ argminλ∈Rm

∥∥∥∥∥m∑i=1

λi∇hi(xk)

∥∥∥∥∥2

s.t. λ ∈ ∆m

Then, dk = −∑m

i=1(λk)i∇hi(xk) is a common descent direction.

Note: when m = 1, one recovers dk = −∇h1(xk).

LNV Multi-Objective Optimization 25/49

Page 54: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

”Steepest” common descent direction

The multi-gradient algorithm iterates: xk+1 = xk + αkdk.

Subproblem 1 (Fliege and Svaiter 2000):

dk ∈ argmind∈Rn

max1≤i≤m

−∇hi(xk)>d+1

2‖d‖2

Note: dk = 0 ∈ Rn iff xk is Pareto stationary.

Subproblem 2: dual problem of Subproblem 1

λk ∈ argminλ∈Rm

∥∥∥∥∥m∑i=1

λi∇hi(xk)

∥∥∥∥∥2

s.t. λ ∈ ∆m

Then, dk = −∑m

i=1(λk)i∇hi(xk) is a common descent direction.

Note: when m = 1, one recovers dk = −∇h1(xk).

LNV Multi-Objective Optimization 25/49

Page 55: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

”Steepest” common descent direction

The multi-gradient algorithm iterates: xk+1 = xk + αkdk.

Subproblem 1 (Fliege and Svaiter 2000):

dk ∈ argmind∈Rn

max1≤i≤m

−∇hi(xk)>d+1

2‖d‖2

Note: dk = 0 ∈ Rn iff xk is Pareto stationary.

Subproblem 2: dual problem of Subproblem 1

λk ∈ argminλ∈Rm

∥∥∥∥∥m∑i=1

λi∇hi(xk)

∥∥∥∥∥2

s.t. λ ∈ ∆m

Then, dk = −∑m

i=1(λk)i∇hi(xk) is a common descent direction.

Note: when m = 1, one recovers dk = −∇h1(xk).

LNV Multi-Objective Optimization 25/49

Page 56: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

”Steepest” common descent direction

The multi-gradient algorithm iterates: xk+1 = xk + αkdk.

Subproblem 1 (Fliege and Svaiter 2000):

dk ∈ argmind∈Rn

max1≤i≤m

−∇hi(xk)>d+1

2‖d‖2

Note: dk = 0 ∈ Rn iff xk is Pareto stationary.

Subproblem 2: dual problem of Subproblem 1

λk ∈ argminλ∈Rm

∥∥∥∥∥m∑i=1

λi∇hi(xk)

∥∥∥∥∥2

s.t. λ ∈ ∆m

Then, dk = −∑m

i=1(λk)i∇hi(xk) is a common descent direction.

Note: when m = 1, one recovers dk = −∇h1(xk).

LNV Multi-Objective Optimization 25/49

Page 57: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

Presentation outline

1 Introduction to optimization models in Data Science and Learning

2 Stochastic gradient descent for Stochastic Optimization

3 Multi-Objective Optimization

4 Stochastic Multi-Objective OptimizationThe stochastic multi-gradient algorithm and assumptionsConvergence rates in the strongly convex and convex cases

5 Implementation and numerical results

6 Conclusions and future directions

Page 58: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

Stochastic Multi-Objective Optimization

Consider a stochastic multi-objective problem

min F (x) = (f1(x), . . . , fm(x)) = (E[f1(x;w)], . . . ,E[fm(x;w)])

s.t. x ∈ X

where w ∈ Rm×p are random parameters obeying a certain distribution.

Subproblem 3: replace ∇fi(xk) by an estimate gi(xk;wk) in Subproblem 2

λg(xk;wk) ∈ argminλ∈Rm

∥∥∥∥∥m∑i=1

λigi(xk;wk)

∥∥∥∥∥2

s.t. λ ∈ ∆m

g(xk;wk) =∑m

i=1 (λgk)igi(xk;wk) denotes the stochastic multi-gradient.

The stochastic multi-gradient (SMG) algorithm: xk+1 = xk − αkg(xk;wk)

LNV Stochastic Multi-Objective Optimization 27/49

Page 59: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

Stochastic Multi-Objective Optimization

Consider a stochastic multi-objective problem

min F (x) = (f1(x), . . . , fm(x)) = (E[f1(x;w)], . . . ,E[fm(x;w)])

s.t. x ∈ X

where w ∈ Rm×p are random parameters obeying a certain distribution.

Subproblem 3: replace ∇fi(xk) by an estimate gi(xk;wk) in Subproblem 2

λg(xk;wk) ∈ argminλ∈Rm

∥∥∥∥∥m∑i=1

λigi(xk;wk)

∥∥∥∥∥2

s.t. λ ∈ ∆m

g(xk;wk) =∑m

i=1 (λgk)igi(xk;wk) denotes the stochastic multi-gradient.

The stochastic multi-gradient (SMG) algorithm: xk+1 = xk − αkg(xk;wk)

LNV Stochastic Multi-Objective Optimization 27/49

Page 60: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

Stochastic Multi-Objective Optimization

Consider a stochastic multi-objective problem

min F (x) = (f1(x), . . . , fm(x)) = (E[f1(x;w)], . . . ,E[fm(x;w)])

s.t. x ∈ X

where w ∈ Rm×p are random parameters obeying a certain distribution.

Subproblem 3: replace ∇fi(xk) by an estimate gi(xk;wk) in Subproblem 2

λg(xk;wk) ∈ argminλ∈Rm

∥∥∥∥∥m∑i=1

λigi(xk;wk)

∥∥∥∥∥2

s.t. λ ∈ ∆m

g(xk;wk) =∑m

i=1 (λgk)igi(xk;wk) denotes the stochastic multi-gradient.

The stochastic multi-gradient (SMG) algorithm: xk+1 = xk − αkg(xk;wk)

LNV Stochastic Multi-Objective Optimization 27/49

Page 61: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

The stochastic multi-gradient method

Consider using an orthogonal projection P when minimize over a closedand convex set X ⊆ Rn.

Algorithm 1 Stochastic Multi-Gradient (SMG) Algorithm

1: Choose an initial point x0 ∈ Rn and a step-size sequence αkk∈N > 0.

2: for k = 0, 1, . . . do

3: Compute the stochastic gradients gi(xk;wk) for i = 1, . . . ,m.

4: Solve Subproblem 3 to obtain the stochastic multi-gradient

g(xk;wk) =

m∑i=1

(λgk)igi(xk;wk),with λgk ∈ ∆m.

5: Update the next iterate xk+1 = PX (xk − αkg(xk;wk)).

6: end for

LNV Stochastic Multi-Objective Optimization 28/49

Page 62: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

The stochastic multi-gradient method

Consider using an orthogonal projection P when minimize over a closedand convex set X ⊆ Rn.

Algorithm 1 Stochastic Multi-Gradient (SMG) Algorithm

1: Choose an initial point x0 ∈ Rn and a step-size sequence αkk∈N > 0.

2: for k = 0, 1, . . . do

3: Compute the stochastic gradients gi(xk;wk) for i = 1, . . . ,m.

4: Solve Subproblem 3 to obtain the stochastic multi-gradient

g(xk;wk) =

m∑i=1

(λgk)igi(xk;wk),with λgk ∈ ∆m.

5: Update the next iterate xk+1 = PX (xk − αkg(xk;wk)).

6: end for

LNV Stochastic Multi-Objective Optimization 28/49

Page 63: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

Subproblem 3 illustration

Consider the case m = 2, n = 2.

∇f1(x)

g11

g21

g2

∇f2(x)g22

g12

g

g1

g1 and g2 are two stochastic multi-gradients by solving Subproblem 3.

They are estimates of the true multi-gradient g (Subproblem 1 or 2).

LNV Stochastic Multi-Objective Optimization 29/49

Page 64: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

Biasedness of the stochastic multi-gradient

Denote

S(x;λ) =∑m

i=1 λifi(x)

∇xS(x;λ) =∑m

i=1 λi∇fi(x)

Even under the classical assumption in SG

Ew[gi(x;w)] = ∇fi(x), ∀i = 1, . . . ,m

it turns out that g(x;w) is a biased estimate, i.e.,

Ew[g(x;w)] 6= ∇xS(x;λ)

where λ are the true coefficients from Subproblem 2. We also have

Ew[g(x;w)] 6= Ew[∇xS(x;λg)]

LNV Stochastic Multi-Objective Optimization 30/49

Page 65: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

Biasedness of the stochastic multi-gradient

Denote

S(x;λ) =∑m

i=1 λifi(x)

∇xS(x;λ) =∑m

i=1 λi∇fi(x)

Even under the classical assumption in SG

Ew[gi(x;w)] = ∇fi(x), ∀i = 1, . . . ,m

it turns out that g(x;w) is a biased estimate, i.e.,

Ew[g(x;w)] 6= ∇xS(x;λ)

where λ are the true coefficients from Subproblem 2.

We also have

Ew[g(x;w)] 6= Ew[∇xS(x;λg)]

LNV Stochastic Multi-Objective Optimization 30/49

Page 66: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

Biasedness of the stochastic multi-gradient

Denote

S(x;λ) =∑m

i=1 λifi(x)

∇xS(x;λ) =∑m

i=1 λi∇fi(x)

Even under the classical assumption in SG

Ew[gi(x;w)] = ∇fi(x), ∀i = 1, . . . ,m

it turns out that g(x;w) is a biased estimate, i.e.,

Ew[g(x;w)] 6= ∇xS(x;λ)

where λ are the true coefficients from Subproblem 2. We also have

Ew[g(x;w)] 6= Ew[∇xS(x;λg)]

LNV Stochastic Multi-Objective Optimization 30/49

Page 67: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

Biasedness illustration

The biasedness using either the true coefficients λ or λg decreases as

batch size increases and eventually vanishes in the full batch setting.

0 100 200 300 400 500

batch size

-1

0

1

2

3

4

5

6

7

l 2 n

orm

of expecte

d e

rror

usin

g

10-4

2000 2200 2400 2600 2800 3000

batch size [2000, 3000]

0

10-5

0 100 200 300 400 500

batch size

-0.5

0

0.5

1

1.5

2

2.5

l 2 n

orm

of expecte

d e

rror

usin

g

g k

10-3

2000 2200 2400 2600 2800 3000

batch size [2000, 3000]

0

10-5

LNV Stochastic Multi-Objective Optimization 31/49

Page 68: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

Assumptions for convergence

1 All objective functions fi : Rn → R are continuously differentiablewith Lipschitz continuous gradients ∇fi.

2 The feasible region X ⊆ Rn is bounded.

3 (Unbiasedness) Ew[gi(x;w)] = ∇fi(x).

4 (Bound on the biasedness) There exist M1,MF > 0 such that

‖Ew[g(x;w)−∇xS(x;λg)]‖ ≤ α (M1 +MF ‖Ew[∇xS(x;λg)]‖)

(can be guaranteed by dynamic sampling)

5 (Bound on the second moment) There exist G,GV > 0 such that

Ew[‖g(x;w)‖2] ≤ G2 +G2V ‖Ew[∇xS(x;λg)]‖2

LNV Stochastic Multi-Objective Optimization 32/49

Page 69: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

Sublinear rate in the strongly convex case

Theorem (Individual fi are strongly convex (c is max of constants))

Assume λt → λ∗. Let x∗ ∈ P be associated with λ∗.

Considering a diminishing step-size sequence αt = 2c(t+1) , we have

mint=1,...,k

E[S(xt;λt)]− E[S(x∗; λk)] ≤ O(1/k)

where λk =∑k

t=1t∑kt=1 t

λt ∈ ∆m.

Then, we have min1≤t≤k E[S(xt;λt)]→ E[S(x∗;λ∗)].

LNV Stochastic Multi-Objective Optimization 33/49

Page 70: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

Rate in the strongly convex case: stronger assumption

Assume λk being a better approximation to λ∗.

We know ∇xS(x∗;λ∗)>(xk − x∗) ≥ 0.

f1

f2

λ∗

λk

Assumption

For any xk ∈ X , one has

∇xS(x∗;λk)>(xk − x∗) ≥ 0

where λk are the true coefficients by solving Subproblem 2.

LNV Stochastic Multi-Objective Optimization 34/49

Page 71: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

Rate in the strongly convex case: stronger assumption

Assume λk being a better approximation to λ∗.

We know ∇xS(x∗;λ∗)>(xk − x∗) ≥ 0.

f1

f2

λ∗

λk

Assumption

For any xk ∈ X , one has

∇xS(x∗;λk)>(xk − x∗) ≥ 0

where λk are the true coefficients by solving Subproblem 2.

LNV Stochastic Multi-Objective Optimization 34/49

Page 72: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

Rate in the strongly convex case: stronger assumption

Theorem (SL&LNV 2019)

Let x∗ be the Pareto minimizer associated with λ∗.

Considering a step-size sequence αk = γ/k with γ > 1/2c, we have

E[‖xk − x∗‖2] ≤ O(1/k)

and

E[S(xk;λ∗)]− E[S(x∗;λ∗)] ≤ O(1/k)

LNV Stochastic Multi-Objective Optimization 35/49

Page 73: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

Presentation outline

1 Introduction to optimization models in Data Science and Learning

2 Stochastic gradient descent for Stochastic Optimization

3 Multi-Objective Optimization

4 Stochastic Multi-Objective OptimizationThe stochastic multi-gradient algorithm and assumptionsConvergence rates in the strongly convex and convex cases

5 Implementation and numerical results

6 Conclusions and future directions

Page 74: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

Pareto-Front stochastic multi-gradient method

The Pareto-Front version of SMG is designed to obtain a complete Paretofront in a single run.

Algorithm 2 Pareto-Front Stochastic Multi-Gradient (PF-SMG) Algorithm

1: Generate a list of starting points L0. Select r, p, q ∈ N.2: for k = 0, 1, . . . do3: Set Lk+1 = Lk.4: for each point x in the list Lk+1 do5: for t = 1, . . . , r do6: Add x+wt to the list Lk+1 where wt is a realization of wk.7: for each point x in the list Lk+1 do8: for t = 1, . . . , p do9: Apply q iterations of the SMG algorithm starting from x.

10: Add the final output point xq to the list Lk+1.11: Remove all the dominated points from Lk+1.

LNV Implementation and numerical results 37/49

Page 75: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

PF-SMG illustration

r = 3, p = 2

f1

f2 Generate starting points

LNV Implementation and numerical results 38/49

Page 76: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

PF-SMG illustration

r = 3, p = 2

f1

f2 Add perturbed points

LNV Implementation and numerical results 39/49

Page 77: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

PF-SMG illustration

r = 3, p = 2

f1

f2 Apply multiple times SMG

LNV Implementation and numerical results 40/49

Page 78: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

PF-SMG illustration

r = 3, p = 2

f1

f2 Remove dominated points

LNV Implementation and numerical results 41/49

Page 79: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

PF-SMG illustration

r = 3, p = 2

f1

f2 After certain number of iterations

LNV Implementation and numerical results 42/49

Page 80: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

Numerical results: logistic regression problems

Motivation: consider a set of data points in 2D

y = 1y = −1

For entire group

For minor group

Data points labeled by and • may be collected from differentsources/groups.

Key idea: design a MOO problem to identify the existence of bias in dataand define the best trade-off if data bias exists.

LNV Implementation and numerical results 43/49

Page 81: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

Numerical results: logistic regression problems

Motivation: consider a set of data points in 2D

y = 1y = −1

For entire group

For minor group

Data points labeled by and • may be collected from differentsources/groups.

Key idea: design a MOO problem to identify the existence of bias in dataand define the best trade-off if data bias exists.

LNV Implementation and numerical results 43/49

Page 82: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

Recall the logistic (prediction) loss function

f(r, b) =1

N

N∑j=1

log(1 + e−yj(r>aj−b)) +λ

2‖r‖2

where (aj , yj)Nj=1 are i.i.d. feature/label pairs sampled from a

certain joint probability distribution of (A, Y ).

Testing data sets are selected from LIBSVM (Chang and Lin 2011).

Split a data set into two groups according to a binary feature. Let J1

and J2 be two index sets. A two-objective problem is constructed as

minr,b

(f1(r, b), f2(r, b))

where

fi(r, b) =1

|Ji|∑j∈Ji

log(1 + e−yj(r>aj−b)) +λi2‖r‖2

LNV Implementation and numerical results 44/49

Page 83: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

Approximated Pareto fronts for the multi-objective logistic regressionproblems:

0.42 0.44 0.46 0.48 0.50f_1(x)

0.26

0.28

0.30

0.32

0.34

f_2(x

)

(a) heart

0.364 0.366 0.368 0.370 0.372f_1(x)

0.2900

0.2925

0.2950

0.2975

0.3000

0.3025

0.3050

0.3075

f_2(

x)

(b) australian

0.6 0.8 1.0 1.2 1.4 1.6f_1(x)

0.4

0.6

0.8

1.0

1.2

f_2(x

)

(c) svmguide3

0.450 0.455 0.460 0.465 0.470 0.475f_1(x)

0.505

0.510

0.515

0.520

0.525

0.530

f_2(x

)

(d) german.numer

LNV Implementation and numerical results 45/49

Page 84: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

Consistently, wider Pareto fronts of heart and svmguide3 indicate higher

distinction between two groups.

Two implications:

Given groups of data instances for the same problem, one can

evaluate the bias by observing the range of Pareto fronts.

New data instance (of unknown group) can be classified more

accurately by selecting a set of nondominated points.

LNV Implementation and numerical results 46/49

Page 85: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

Consistently, wider Pareto fronts of heart and svmguide3 indicate higher

distinction between two groups.

Two implications:

Given groups of data instances for the same problem, one can

evaluate the bias by observing the range of Pareto fronts.

New data instance (of unknown group) can be classified more

accurately by selecting a set of nondominated points.

LNV Implementation and numerical results 46/49

Page 86: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

Presentation outline

1 Introduction to optimization models in Data Science and Learning

2 Stochastic gradient descent for Stochastic Optimization

3 Multi-Objective Optimization

4 Stochastic Multi-Objective OptimizationThe stochastic multi-gradient algorithm and assumptionsConvergence rates in the strongly convex and convex cases

5 Implementation and numerical results

6 Conclusions and future directions

Page 87: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

Conclusions

Established sublinear convergence rates, O(1/k) for strongly convexand O(1/

√k) for convex case in terms of a weighted sum function.

Designed PF-SMG algorithm that is robust and efficient to generatewell-spread and sufficiently accurate Pareto fronts.

In logistic binary classification, developed a novel tool for identifyingbias among potentially different sources of data.

S. Liu and L. N. Vicente, The stochastic multi-gradient algorithm formulti-objective optimization and its application to supervised machine learning,ISE Technical Report 19T-011, Lehigh University.

LNV Conclusions and future directions 48/49

Page 88: Stochastic Optimization of Multiple Objectives and Supervised Machine Learninglnv/talks/pitts-2019.pdf · 2019-11-14 · 1 Introduction to optimization models in Data Science and

Future directions

Have a (probabilistic?) result for determining the whole Pareto front.

Investigate variance reduction techniques.

Deal with nonconvexity, nonsmoothness, general constraints, . . .

Expand use of MOO in machine learning:

Handling discrimination and unfairness (Calders et al. 2009; Hardt et

al. 2016).

Conflicting robotic learning.

LNV Conclusions and future directions 49/49