optimization models for machine learning: a survey · optimization models for machine learning: a...

65
Optimization Models for Machine Learning: A Survey Claudio Gambella 1 Bissan Ghaddar 2 Joe Naoum-Sawaya 2 1 IBM Research Ireland, Mulhuddart, Dublin 15, Ireland 2 Ivey Business School, University of Western Ontario, London, Ontario N6G 0N1, Canada Abstract This paper surveys the machine learning literature and presents machine learning as optimization models. Such models can benefit from the advancement of numerical optimization techniques which have already played a distinctive role in several machine learning set- tings. Particularly, mathematical optimization models are presented for commonly used machine learning approaches for regression, clas- sification, clustering, and deep neural networks as well new emerging applications in machine teaching and empirical model learning. The strengths and the shortcomings of these models are discussed and po- tential research directions are highlighted. Keywords: Machine Learning, Mathematical Optimization, Regres- sion, Classification, Clustering, Deep Learning Contents 1 Introduction 2 2 Regression Models 4 2.1 Shrinkage methods ........................ 6 2.2 Dimension Reduction ....................... 7 2.2.1 Principal Components .................. 7 2.2.2 Partial Least Squares ................... 8 2.3 Non-Linear Models for Regression ................ 9 1 arXiv:1901.05331v1 [math.OC] 16 Jan 2019

Upload: others

Post on 30-Jun-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Optimization Models for Machine Learning: A Survey · Optimization Models for Machine Learning: A Survey Claudio Gambella1 Bissan Ghaddar2 Joe Naoum-Sawaya2 1IBM Research Ireland,

Optimization Models for Machine Learning: ASurvey

Claudio Gambella1 Bissan Ghaddar2

Joe Naoum-Sawaya2

1IBM Research Ireland, Mulhuddart, Dublin 15, Ireland2Ivey Business School, University of Western Ontario, London, Ontario N6G

0N1, Canada

Abstract

This paper surveys the machine learning literature and presentsmachine learning as optimization models. Such models can benefitfrom the advancement of numerical optimization techniques whichhave already played a distinctive role in several machine learning set-tings. Particularly, mathematical optimization models are presentedfor commonly used machine learning approaches for regression, clas-sification, clustering, and deep neural networks as well new emergingapplications in machine teaching and empirical model learning. Thestrengths and the shortcomings of these models are discussed and po-tential research directions are highlighted.Keywords: Machine Learning, Mathematical Optimization, Regres-sion, Classification, Clustering, Deep Learning

Contents

1 Introduction 2

2 Regression Models 42.1 Shrinkage methods . . . . . . . . . . . . . . . . . . . . . . . . 62.2 Dimension Reduction . . . . . . . . . . . . . . . . . . . . . . . 7

2.2.1 Principal Components . . . . . . . . . . . . . . . . . . 72.2.2 Partial Least Squares . . . . . . . . . . . . . . . . . . . 8

2.3 Non-Linear Models for Regression . . . . . . . . . . . . . . . . 9

1

arX

iv:1

901.

0533

1v1

[m

ath.

OC

] 1

6 Ja

n 20

19

Page 2: Optimization Models for Machine Learning: A Survey · Optimization Models for Machine Learning: A Survey Claudio Gambella1 Bissan Ghaddar2 Joe Naoum-Sawaya2 1IBM Research Ireland,

3 Classification 103.1 K-Nearest Neighbors . . . . . . . . . . . . . . . . . . . . . . . 103.2 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . 113.3 Linear Discriminant Analysis . . . . . . . . . . . . . . . . . . 123.4 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.5 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . 17

3.5.1 Hard Margin SVM . . . . . . . . . . . . . . . . . . . . 173.5.2 Soft-Margin SVM . . . . . . . . . . . . . . . . . . . . . 183.5.3 Sparse SVM . . . . . . . . . . . . . . . . . . . . . . . . 193.5.4 The Dual Problem and Kernel Tricks . . . . . . . . . . 19

4 Clustering 204.1 Minimum Sum-Of-Squares Clustering (a.k.a. k-Means Clus-

tering) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.2 Capacitated Clustering . . . . . . . . . . . . . . . . . . . . . . 244.3 k-Hyperplane Clustering . . . . . . . . . . . . . . . . . . . . . 25

5 Deep Learning 265.1 DNN Training with Mixed Integer Programming . . . . . . . . 275.2 Adversarial Learning . . . . . . . . . . . . . . . . . . . . . . . 32

5.2.1 Targeted attacks . . . . . . . . . . . . . . . . . . . . . 335.2.2 Untargeted attacks . . . . . . . . . . . . . . . . . . . . 345.2.3 Models resistant to adversarial attacks . . . . . . . . . 35

5.3 Data poisoning . . . . . . . . . . . . . . . . . . . . . . . . . . 375.4 Activation ensembles . . . . . . . . . . . . . . . . . . . . . . . 39

6 Emerging paradigms 406.1 Machine Teaching . . . . . . . . . . . . . . . . . . . . . . . . . 406.2 Empirical Model Learning . . . . . . . . . . . . . . . . . . . . 43

7 Useful resources 44

8 Conclusion 44

1 Introduction

The pursuit to create intelligent machines that can match and potentiallyrival humans in reasoning and making intelligent decisions goes back to atleast the early days of the development of digital computing in the late 1950s[175]. The goal is to enable machines to perform cognitive functions bylearning from past experiences and then solving complex problems under

2

Page 3: Optimization Models for Machine Learning: A Survey · Optimization Models for Machine Learning: A Survey Claudio Gambella1 Bissan Ghaddar2 Joe Naoum-Sawaya2 1IBM Research Ireland,

conditions that are varying from past observations. Fueled by the exponentialgrowth in computing power and data collection coupled with the widespreadof practical applications, machine learning is nowadays a field of strategicimportance.

Broadly speaking, machine learning relies on learning a model that re-turns the correct output given a certain input. The inputs, i.e. predictormeasurements, are typically numerical values that represent the parametersthat define a problem while the output, i.e. response, is a numerical valuethat represents the solution. Machine learning models fall into two categories:supervised and unsupervised learning [90, 116]. In supervised learning, aresponse measurement is available for each observation of predictor measure-ments and the aim is to fit a model that accurately predicts the responseof future observations. In unsupervised learning on the otherhand, responsevariables are not available and the goal of learning is to understand the un-derlying characteristics of the observations. The fundamental theory of theselearning models and consequently their success can be largely attributed toresearch at the interface of computer science, statistics, and operations re-search. The relation between machine learning and operations research canbe viewed along three dimensions: (a) machine learning applied to manage-ment science problems, (b) machine learning to solve optimization problems,(c) machine learning problems formulated as optimization problems.

Leveraging data in business decision making is nowadays mainstream asany business in today’s economy is instrumented for data collection andanalysis. While the aim of machine learning is to generate reliable pre-dictions, management science problems deal with optimal decision making.Thus methodological developments that can leverage data predictions foroptimal decision making is an area of research that is critical for future busi-ness value [30, 128]. Another area of research at the interface of machinelearning and operations research is using machine learning to solve hard op-timization problems and particularly NP-hard integer constrained optimiza-tion [25, 41, 126, 124, 140, 185]. In that domain, machine learning models areintroduced to complement existing approaches that exploit combinatorial op-timization through structure detection, branching, and heuristics. Lastly, thetraining of machine learning models can be naturally posed as an optimiza-tion problem with typical objectives that include optimizing training error,measure of fit, and cross-entropy [42, 43, 71, 190]. In fact, the widespreadadoption of machine learning is in parts attributed to the development of effi-cient solution approaches for these optimization problems which enabled thetraining of machine learning models. As we review in this paper, the devel-opment of these optimization models has largely been concentrated in areasof computer science, statistics, and operations research however diverging

3

Page 4: Optimization Models for Machine Learning: A Survey · Optimization Models for Machine Learning: A Survey Claudio Gambella1 Bissan Ghaddar2 Joe Naoum-Sawaya2 1IBM Research Ireland,

publication outlets, standards, and terminology persist.The aim of this paper is thus to present a unifying framework for ma-

chine learning as optimization problems. For that, in addition to publicationsin classical operations research journals, this paper surveys artificial intelli-gence conferences and journals, such as the conference on Neural InformationProcessing Systems and the International Conference on Machine Learning.Furthermore, since machine learning research has rapidly accelerated withmany important papers still in the review process, this paper also surveysa considerable number of relevant papers that are available on the arXivrepository. This paper also complements the recent surveys of [43, 71, 190]which have largely focused on the methodological developments for solvingseveral classes of machine learning optimization problems. Particularly thispaper presents optimization models for regression, classification, clustering,and deep learning (including adversarial attacks), as well as new emergingparadigms such as machine teaching and empirical model learning. Addition-ally, this paper highlights the strengths and the shortcomings of the modelsfrom a mathematical optimization perspective and discusses potential novelresearch directions.

Following this introductory section, regression models are discussed inSection 2 while classification and clustering models are presented in Sections 3and 4, respectively. Deep learning models are presented in Section 5 whileSection 6 discusses new emerging paradigms that include machine teachingand empirical model learning. Section 7 summarizes the datasets and soft-ware systems that are commonly used in the literature. Finally, Section 8concludes.

2 Regression Models

Linear regression models are widely known approaches in supervised learningfor predicting a quantitative response. The central assumption is that the de-pendence of the dependent variables (feature measurements, or predictors) tothe independent variables (real-valued output) is representable with a linearfunction (regression function) with a reasonable accuracy. Linear regressionmodels have been largely adopted since the early era of statistics, and theypreserve considerable interest, given their simplicity, their extensive range ofapplications, and the ease of interpretability. Linear regression aims to find aregression function f that expresses the linear relation between input vectorXT = (X1, . . . , Xp) and real-valued output Y via the regression coefficientsβj such as

4

Page 5: Optimization Models for Machine Learning: A Survey · Optimization Models for Machine Learning: A Survey Claudio Gambella1 Bissan Ghaddar2 Joe Naoum-Sawaya2 1IBM Research Ireland,

Y = f(X) = β0 +

p∑j=1

Xjβj. (1)

In order to estimate the parameters βj, one needs training examplesx1, . . . , xn with labels y1, . . . , yn, and the objective is to minimize a mea-sure of loss due to the prediction on the training dataset. The most popularestimation is the least squared estimate, which minimizes the residual sum ofsquares (RSS) between the labels and the predicted outputs:

RSS(β) =n∑i=1

(yi − β0 −p∑j=1

xijβj)2. (2)

The least squares estimate is known to have the smallest variance amongall linear unbiased estimates, and has a closed form solution. However, thischoice is not always ideal, since it can yield a model with low predictionaccuracy, due to a large variance, and often leads to a large number of non-zero regression coefficients (i.e., low interpretability). Sections 2.1, 2.2 and2.3 present some of the most important alternatives to the least squaredestimate.

The process of gathering input data is often affected by noise, which canimpact the accuracy of statistical learning methods. A model that takesinto account the adversarial noise applied to feature measurements in linearregression problems is presented in [28] which also investigates the relation-ship between regularization and robustness to noise. The noise is assumedto vary in an uncertainty set U ∈ Rn×p, and the defender adopts the robustprospective:

minβ∈Rp

max∆∈U

g(y − (X + ∆))β (3)

where g is a convex function that measures the residuals (e.g., a norm func-tion). The characterization of the uncertainty set U directly influences thecomplexity of Problem (3). Adversarial learning is extensively discussed inthe context of neural networks in Section 5.2. The design of high-qualitylinear regression models requires several desirable properties, which are oftenconflicting and not simultaneously implementable. A fitting procedure basedon Mixed Integer Quadratic Programming (MIQP) is presented in[31] andtakes into account sparsity, joint inclusion of subset of features (called se-lective sparsity), robustness to noisy data, stability against outliers, modeler

5

Page 6: Optimization Models for Machine Learning: A Survey · Optimization Models for Machine Learning: A Survey Claudio Gambella1 Bissan Ghaddar2 Joe Naoum-Sawaya2 1IBM Research Ireland,

expertise, statistical significance, and low global multicollinearity. Mixed in-teger programming (MIP) models for regression and classification tasks arealso investigated in [34]. The regression problem is modeled as an assignmentof data points to groups with the same regression coefficients.

2.1 Shrinkage methods

Shrinkage methods seek to diminish the value of the regression coefficients.The aim is to obtain a more interpretable model (with less relevant features),at the price of introducing some bias in the model determination. A well-known shrinkage method is Ridge regression, where a norm-2 penalizationon the regression coefficients is added to the loss function such that

RSSridge(β) =n∑i=1

(yi − β0 −p∑j=1

xijβj)2 + λ

p∑j=1

β2j (4)

where λ controls the magnitude of shrinkage. Ridge regression can be equiv-alently expressed as

minβ

n∑i=1

(yi − β0 −p∑j=1

xijβj)2 (5)

s. t.

p∑j=1

β2j ≤ t. (6)

The solution of (5) - (6) can be expressed in closed form, in terms of thesingular value decomposition of matrix X. Instead of the conic constraint(6), lasso regression instead penalizes the norm-1 of the regression coefficientsand the following quadratic programming problem is obtained

minβ

n∑i=1

(yi − β0 −p∑j=1

xijβj)2 (7)

s. t.

p∑j=1

|βj| ≤ t. (8)

Ridge and lasso regression belong to a class of techniques to achieve sparseregression. As discussed in [33], the sparse regression problem can be formu-lated as the best subset selection problem

minβ∈Rp

1

2γ‖w‖2

2 +1

2‖Y −Xβ‖2

2 (9)

s.t ‖β‖0 ≤ k (10)

6

Page 7: Optimization Models for Machine Learning: A Survey · Optimization Models for Machine Learning: A Survey Claudio Gambella1 Bissan Ghaddar2 Joe Naoum-Sawaya2 1IBM Research Ireland,

where γ > 0 weights the Tikhonov regularization term and k is an upperbound on the number of predictors with a non-zero regression coefficient, i.e.,the predictors to select. Problem (9)–(10) is NP-hard due to the cardinalityconstraint (10). Introducing the binary variables s ∈ s ∈ 0, 1p : IT s ≤ k,the sparse regression problem can be transformed into the following MIP

minβ∈Rp,x∈Spk

1

2γ‖β‖2

2 +1

2‖Y −Xw‖2

2 (11)

s.t −Msj ≤ βj ≤Msj ∀j = 1, . . . , p. (12)

As shown in [33], problem (11)-(12) can be solved using a cutting planeapproach. The proposed cutting plane approach is an altertnative to thediscrete first order algorithms of [32] which solve the MIQP (11)-(12) withoutregularization terms.

2.2 Dimension Reduction

While shrinkage methods improve model interpretability and retain the orig-inal set of p predictors, dimension reduction methods search for M < p linearcombinations of the predictors such that Zm =

∑pj=1 φjmXj (also called pro-

jections). The M transformed predictors are then fitted in a linear regressionmodel.

2.2.1 Principal Components

Principal Components Analysis (PCA) [118] constructs features with largevariance based on the original set of features. In particular, assuming theregressors are standardized to a mean of 0 and a variance of 1, the directionof the first principal component is a unit vector φ1 ∈ Rp that is the solutionof the optimization problem

maxφ1∈Rp

1

n

n∑i=1

(n∑j=p

φj1xij

)2

(13)

s.t.

p∑j=1

φ2j1 = 1. (14)

Problem (13)-(14) is the traditional formulation of PCA and can be solved viaLagrange multipliers methods. Since the formulation is sensitive to the pres-ence of outliers, several approaches have been proposed to improve robustness[164]. One approach is to replace the L2 norm in (13) with the L1 norm. The

7

Page 8: Optimization Models for Machine Learning: A Survey · Optimization Models for Machine Learning: A Survey Claudio Gambella1 Bissan Ghaddar2 Joe Naoum-Sawaya2 1IBM Research Ireland,

first principal component Z1 =∑p

j=1 φj1Xj is the projection of the originalfeatures with the largest variability. The subsequent principal componentsare obtained iteratively. Each principal component Zm,m = 2, . . . , p is a lin-ear combination of the features X1, . . . , Xn, uncorrelated with Z1, . . . , Zm−1,which have the largest variance. Introducing the sample covariance matrixS of the regressors Xj, the direction φm of the m-th principal component Zmis the solution of

maxφm∈Rp

1

n

n∑i=1

(n∑j=p

φjmxij

)2

(15)

s.t.

p∑j=1

φ2jm = 1 (16)

φTmSφl = 0 l = 1, . . . ,m− 1. (17)

PCA can be used for several data analysis problems which benefit fromreducing the problem dimension. Principal Components Regression (PCR) isa two-stage procedure that uses the first principal components as predictorsfor a linear regression model. PCR has the advantage of including less pre-dictors than the original set and of retaining the variability of the dataset inthe derived features. In [122], the regression loss function and the PCA ob-jective function are combined in a one-step procedure for PCR. The one-stepframework is extended in [123] to other regression models, such as logisticregression, Poisson regression and multiclass-class logistic regression. Forclassification models with a large number of features, PCA can be exploitedto improve the interpretability of the features by performing classification onthe first principal components. Since the identification of the principal com-ponents does not require any knowledge on the response y, PCA can be alsoadopted in unsupervised learning such as in the k-means clustering method(see section 4.1, [79]).

2.2.2 Partial Least Squares

Partial Least Squares (PLS) identifies transformed features Z1, . . . , ZM bytaking both the predictors X1, . . . , Xn and the response Y into account, andis an approach that is specific to regression problems. The components αj1of the first PLS direction are found by fitting the regression with responseY and predictor Xj. The approach is viable even for problems with a largenumber of features, because only one regressor has to be fitted in a simpleregression model with one predictor. The first PLS direction points towardsthe variables that are more strongly related to the response. For computing

8

Page 9: Optimization Models for Machine Learning: A Survey · Optimization Models for Machine Learning: A Survey Claudio Gambella1 Bissan Ghaddar2 Joe Naoum-Sawaya2 1IBM Research Ireland,

the second PLS direction, the features X1, . . . , Xp are first orthogonalizedwith respect to Z1 (as per the Gram-Schmidt approach), and then individu-ally fitted in simple regression models with response Y . The process can beiterated for M < p PLS directions. The coefficient of the simple regression ofY onto each original feature Xj can also be computed as the inner product〈Y,Xj〉. Analogously to PCR, PLS regression fits a linear regression modelwith regressors Z1, . . . , ZM and response Y .While the PCA directions maximize variance, PLS searches for directionsVm =

∑pj=1 αjmXj with both high variance and high correlation with the

response. The m-th direction αm can be found by solving the optimizationproblem

maxαm

Corr2(y, Vm)Var(Xα) (18)

s.t.

p∑j=1

α2jm = 1 (19)

αTmSαl = 0 l = 1, . . . ,m− 1 (20)

where Corr indicates the correlation matrix, V ar the variance matrix, and Sthe sample covariance matrix of Xj, and (20) ensures that Vm is uncorrelatedwith the previous directions Vl =

∑pj=1 αjlXj.

2.3 Non-Linear Models for Regression

A natural extension of linear regression models is to consider non-linear rela-tionship between regressors and predictors. Among the several non-linear re-gression models are polynomial regression, step functions, regression splines,smoothing splines and local regression. Alternatively, the Generalized Addi-tive Models (GAMs) [107] maintains the additivity of the original predictorsX1, . . . , Xp and the relationship between each feature and the response y isexpressed by non-linear functions fj(Xj) such as

y = β0 +

p∑j=1

fj(Xj). (21)

GAMs may increase the flexibility and accuracy of prediction with non-linearcomponents, while maintaining a level of interpretability of the predictors.However, one limitation is given by the additivity of the features in theresponse. To further increase the model flexibility, one could include predic-tors of the form Xi×Xj, or consider non-parametric models, such as randomforests and boosting. It has been empirically observed that GAMs do not

9

Page 10: Optimization Models for Machine Learning: A Survey · Optimization Models for Machine Learning: A Survey Claudio Gambella1 Bissan Ghaddar2 Joe Naoum-Sawaya2 1IBM Research Ireland,

represent well problems where the number of observations is much larger thanthe number of predictors. In [178] the Generalized Additive Model Selectionis introduced to fit sparse GAMs in high dimension with a penalized like-lihood approach. The penalty term is derived from the fitting criterion forsmoothing splines. Alternatively, [62] proposes to fit a constrained version ofGAMs by solving a conic programming problem.

3 Classification

The task of classifying data is to decide class membership y of an unknowndata item x based on an input data set D = (x1, y1), · · · , (xn, yn) of dataitems xi with known class memberships yi. The xi are usually m-dimensionalvectors. This section reviews the common classification approaches that in-clude k-nearest neighbors, logistic regression, linear discriminant analysis,decision trees, and support vector machines.

3.1 K-Nearest Neighbors

Classification based on the k-nearest neighbor (k-NN) algorithm differs fromother methods, as this approach uses the data directly, without buildinga model first. Unlike other supervised learning algorithms, k-NN doesntlearn an explicit mapping from the training data, it simply uses the trainingdata to make predictions, thus it is called a non-parametric method. Asuccessful application of k-NN requires a careful choice of the number ofnearest neighbors k and the choice of the distance metric. While k-NN isvery easy to implement, the results are highly dependent on the choice of kin some cases. By varying k, the model can be made more or less flexibleby choosing small or large values of k, respectively. Given a set of trainingdata D = (x1, y1), · · · , (xn, yn) with xi having m features/dimensions of thesame scale and the corresponding output yi of discrete or continuous values,k-NN computes a test point’s distance from each of the training data. Thecomputed distances are sorted and the k nearest neighbors are selected. Themajority rule is then used for classification (output y is discrete) or theaveraging is used for regression (output y is continuous). Several distancefunctions can be used. For real-valued features, the euclidean norm

d(xi, tj) = ||x− t||2 =

√√√√ n∑l=1

(xil − tjl)2)

10

Page 11: Optimization Models for Machine Learning: A Survey · Optimization Models for Machine Learning: A Survey Claudio Gambella1 Bissan Ghaddar2 Joe Naoum-Sawaya2 1IBM Research Ireland,

is commonly used. The Minkowski distance

d(xi, tj) = ||x− t||p =

(n∑l=1

(xil − tjl)p) 1

p

is a more general class of distance functions. For binary-valued features, thehamming distance is considered which counts the number of features wherethe two data points disagree. In binary classification problems, it is helpfulto choose k to be an odd number as this avoids tied votes. Weights for eachof the features can be also used when computing the distances [19].

3.2 Logistic Regression

In most problem domains, there is no functional relationship y = f(x) be-tween y and x. In this case, the relationship between x and y has to bedescribed more generally by a probability distribution P (x, y) while assum-ing that the data set D contains independent samples from P . The optimalclass membership decision is to choose the class label y that maximizes theposterior distribution P (y|x). Logistic Regression provides a functional formf and a parameter vector β to express P (y|x) as f(x, β). The parametersβ are determined based on the data set D usually by maximum-likelihoodestimation [81]. Generally, a logistic regression model calculates the classmembership probability for one of the two categories in the data set as

P (1|x, β) = (1

1 + eβ0+βx). (22)

The decision boundary between the two binary classes is formed by a hyper-plane whose equation is β0 + βx = 0. Points at this decision boundary haveP (1|x, β) = P (0|x, β) = 0.5. The optimal parameter values β are obtainedby maximizing the liklihood estimation Πn

i=1P (yi|xi, β) which is equivalentto

min−n∑i=1

logP (yi|xi, β). (23)

First order methods such as gradient descent as well as second order methodssuch as Newton’s method can be applied to optimally solve problem (23).

To tune the logistic regression model (22), variable selection can be per-formed where only the most relevant subset of the x variables are kept in themodel. A forward selection approach or a backward elimination approach canbe applied to add or remove variables respectively, based on the statistical

11

Page 12: Optimization Models for Machine Learning: A Survey · Optimization Models for Machine Learning: A Survey Claudio Gambella1 Bissan Ghaddar2 Joe Naoum-Sawaya2 1IBM Research Ireland,

significance of each of the computed coefficients. Interaction terms can bealso added to (22) to further complicate the model at the risk of overfittingthe training data. Variable selection can then be also applied to eliminatethe non statistically significant interaction terms [90].

3.3 Linear Discriminant Analysis

Linear discriminant analysis(LDA) is an approach for classification and di-mentionality reduction. It is often applied to data that contains a large num-ber of features (such as image data) where reducing the number of featuresis necessary to obtain robust classification. While LDA and PCA share thecommonality of dimentionality reduction, LDA tends to be more robust thanPCA since it takes into account the data labels in computing the optimalprojection matrix [23].

Given a data set with n samples D = (x1, y1), · · · , (xn, yn) and K classeswhere xi ∈ Rm and yi ∈ 0, 1K such that if xi belongs to the k-th classthen yi(k) is 1 and 0 otherwise, the input data is partitioned into K groupsπkKk=1 where πk denotes the sample set of the k-th class which contains nkdata points. LDA maps the features space xi ∈ Rp to a lower dimentionalspace qi ∈ Rr (r < p) through a linear transformation qi = G>xi [187]. Theclass mean of the k-th class is given by µk = 1

nk

∑xi∈πk xi while the global

mean in given by µ = 1n

∑ni=1 xi. In the projected space the class mean is

given by µk = 1nk

∑qi∈πk qi while the global mean in given by µ = 1

n

∑ni=1 qi.

The within-class scatter and the between-class scatter evaluate the classseparability and are defined as Sw and Sb respectively such that

Sw =K∑k=1

∑xi∈πk

(xi − µk)(xi − µk)> (24)

Sb =K∑k=1

nk(µk − µ)(µk − µ)>. (25)

The within-class scatter evaluates the spread of the data around the classmean while the between-class scatter evaluates the spread of the class meansaround the global mean. For the projected data, the within-class and the

12

Page 13: Optimization Models for Machine Learning: A Survey · Optimization Models for Machine Learning: A Survey Claudio Gambella1 Bissan Ghaddar2 Joe Naoum-Sawaya2 1IBM Research Ireland,

between-class scatters are given by

Sw =K∑k=1

∑qi∈πk

(qi − µk)(qi − µk)> = G>SwG (26)

Sb =K∑k=1

nk(µk − µ)(µk − µ)> = G>SbG. (27)

The LDA optimization problem is bi-objective where the within-classshould be minimized while the between-class should be maximized. Thusthe optimal transformation G can be obtained by maximizing the Fishercriterion (the ratio of between-class to within-class scatters)

max|GTSbG||GTSwG|

. (28)

Note that since the between-class and the within-class scatters are not scalar,the determinant is used to obtain a scalar objective function. As discussedin [92], assuming that Sw is invertible and non-singular, the Fisher criterionis optimized by selecting the r largest eigen values of S−1

w SB and the corre-sponding eigen vectors G∗1, G

∗2, . . . , G

∗(K−1) form the optimal transformation

matrix G∗ = [G∗1|G∗2| . . . |G∗(K−1)].An alternative formulation of the LDA optimization problem is provided

in [58] by maximizing the minimum distance between each class center andthe total class center. The proposed approach known as the large marginlinear discriminant analysis requires the solution of non-convex optimizationproblems. A solution approach is also proposed based on solving a series ofconvex quadratic optimization problems.

LDA can also be applied for data with multiple labels. In the multi-labelcase, each data point can belong to multiple classes which is often the casein image and video data (for example an image can be labeled “ear”, “dog”,“animal”). In [187] the equations of the within-class and the between-classscatters are extended to incorporate the correlation between the labels. Apartfrom this change, the LDA optimization problem (28) remains the same.

3.4 Decision Trees

Decision trees are classical models for making a decision or classificationusing splitting rules organized into tree data structure. Tree-based methodsare non-parametric models that partition the predictor space into sub-regionsand then yield a prediction based on statistical indicators (e.g., median and

13

Page 14: Optimization Models for Machine Learning: A Survey · Optimization Models for Machine Learning: A Survey Claudio Gambella1 Bissan Ghaddar2 Joe Naoum-Sawaya2 1IBM Research Ireland,

mode) of the segmented training data. Decision trees can be used for bothregression and classification problems. For regression trees, the splitting ofthe training dataset into distinct and non-overlapping regions, can be doneusing a top-down recursive binary splitting procedure. Starting from a single-region tree, one iteratively searches for (typically univariate) cutpoint s forpredictor Xj such that the tree with the two splitted regions X|Xj < sand X|Xj ≥ s has the greatest possible reduction in the residual sum of

squares∑

i:xi∈R1(j,s)

(yi− yR1)2 +

∑i:xi∈R2(j,s)

(yi− yR2)2, where yR denotes the mean

response for the training observations in region R. A multivariate split is ofthe form X|aTX < s, where a is a vector. Searching for a solution thatminimizes the RSS often leads to overfitting. Another optimization criterionis a measure of purity [46], such as Gini’s index in classification problems.To limit overfitting, it is possible to prune a decision tree so as to obtain sub-trees minimizing, for example, cost complexity. For classification problems,[46] highlights that, given their greedy nature, the classical methods basedon recursive splitting do not lead to the global optimality of the decision treewhich limits the accuracy of decision trees. Since building optimal binarydecision trees is known to be NP-hard [112], heuristic approaches based onmathematical programming paradigms, such as linear optimization [26], con-tinuous optimization [27], dynamic programming [14, 16, 70, 160], have beenproposed.

To find provably optimal decision trees, [29] propose a mixed-integer pro-gramming formulation that has an exponential complexity in the depth of thetree. Given a fixed depth D, the maximum number of nodes is T = 2D+1− 1and they are indexed by t = 1, . . . , T . Following the notation of [29], theset of nodes is split into two sets, branch nodes and leaf nodes. The branchnodes TB = 1, . . . , bdfc apply a linear split a>x < b where the left branchincludes the data that satisfy this split while the right branch includes theremaining data. At the leaf nodes TF = bdfc+ 1, . . . , T, a class predictionis made for the data that points that are included at that node. In [29], thesplits that are applied at the branch nodes are restricted to a single vari-able with the option of not splitting a node. These conditions are enforcedthrough the following constraints

p∑j=1

ajt = dt, ∀t ∈ TB

0 ≤ bt ≤ dt, ∀t ∈ TBajt ∈ 0, 1, ∀j = 1, . . . , p, ∀t ∈ TBdt ∈ 0, 1, ∀t ∈ TB

14

Page 15: Optimization Models for Machine Learning: A Survey · Optimization Models for Machine Learning: A Survey Claudio Gambella1 Bissan Ghaddar2 Joe Naoum-Sawaya2 1IBM Research Ireland,

where dt is a binary variable that indicates if a split is performed at node tor not and ajt and bt are the features coefficients and the right hand side,respectively of the split at node t.

Since all the descendants of a node that does not apply a split should alsonot have a split, then the following constraint is also applied at all the nodesapart from the root node

dt ≤ dp(t), ∀t ∈ TB \ 1

where p(t) denote the parent node of node t.To keep track of the data points, the binary variable zit indicates if data

point i is assigned to leaf node t and the binary variable lt indicates if nodet is used or not. The following constrains

zit ≤ lt, t ∈ TLn∑i=1

zit ≥ Nminlt, t ∈ TL

indicate that data points can be assigned to a node only if that node is usedand if a node is used then at least Nmin data points should be assigned to it.Each data point should also be assigned to exactly one leaf which is enforcedby ∑

t∈TL

zit = 1, i = 1, . . . , n

To enforce the splitting of the data point at each of the branch nodes,the following constraints are introduced

a>mxi + ε ≤ bm +M1(1− zit), i = 1, . . . , n, ∀t ∈ TL, ∀m ∈ AL(t),

a>mxi ≥ bm +M2(1− zit), i = 1, . . . , n, ∀t ∈ TL, ∀m ∈ AR(t),

where AL(t) is the set of ancestors of t whose left branch has been followedon the path from the root node to node t and AR(t) is the set of ancestorsof t whose right branch has been followed on the path from the root node tonode t. M1 and M2 are large numbers while ε is a small number to enforcethe strict split a>x < b at the left branch (see [29] for finding good values forM1, M2, and ε).

Each leaf node that is used should be assigned to a label k = 1 . . . K andthus the following constraint

K∑k=1

ckt = lt, ∀t ∈ TL

15

Page 16: Optimization Models for Machine Learning: A Survey · Optimization Models for Machine Learning: A Survey Claudio Gambella1 Bissan Ghaddar2 Joe Naoum-Sawaya2 1IBM Research Ireland,

where ckt is a binary variable that indicates if label k is assigned to leaf nodet.

The misclassification loss Lt at leaf node t is given by

Lt = Nt −Nkt (29)

if node t is assigned label k where Nt is the total number of data points atleaf node t and Nkt is the total number of data points at node t whose truelabel is k. This is enforced by the constraints

Lt ≥ Nt −Nkt − n(1− ckt), k = 1, . . . , K, ∀t ∈ TLLt ≤ Nt −Nkt + nckt, k = 1, . . . , K, ∀t ∈ TLLt ≥ 0, ∀t ∈ TL.

The counting of Nt and Nkt is enforced by

Nt =n∑i=1

zit, ∀t ∈ TL

Nkt =1

2

n∑i=1

(q + Yik)zit, k = 1, . . . , K, ∀t ∈ TL

where Yik is a parameter that takes a value 1 if the true label of data pointi is k and −1 otherwise.

The objectives are to minimize the decision tree complexity that is given∑t∈TB dt and the normalized total misclassification loss 1

L

∑t∈TL Lt where L

is the baseline loss obtained by predicting the most popular class from theentire dataset. Both are included in a single objective

min1

L

∑t∈TL

Lt + α∑t∈TB

dt

where α is a tuning parameter. The other hyper-parameters that need tobe tuned are the minimum number of data points at each leaf node Nmin

and the maximum tree depth D. Leveraging the hyper-parameter tuningand warm-start techniques, the optimal decision tree problem is shown to becompetitive on datasets with thousands of observations and achieves a goodaccuracy for shallow trees compared to classical greedy approaches [29]. Theproposed formulation can also be extended to multivariate splits, withoutincreasing the computational complexity [29].

An alternative formulation to the optimal decision tree problem is pro-vided in [103]. The main difference between the formulation of [103] and [29]

16

Page 17: Optimization Models for Machine Learning: A Survey · Optimization Models for Machine Learning: A Survey Claudio Gambella1 Bissan Ghaddar2 Joe Naoum-Sawaya2 1IBM Research Ireland,

is that the approach of [103] is specialized to the case where the featurestake categorical values. By exploiting the combinatorial structure that ispresent in the case of categorical variables, [103] provides a strong formula-tion of the optimal decision tree problem thus improving the computationalperformance. Furthermore the formulation of [103] is restricted to binaryclassification and the tree topology is fixed which allows the optimizationproblems to be solved to optimality in reasonable time.

While single decision tree models are often preferred by data analysts fortheir high interpretability, the model accuracy can be largely improved bytaking multiple decision trees into account with approaches such as bagging,random forests, and boosting. Furthermore, decision trees can also be used ina more general range of applications as algorithms for problem solving, datamining, and knowledge representation. In [15], several greedy and dynamicprogramming approaches are compared for building decision trees on datasetswith inconsistent labels (i.e, many-valued decision approach). Many-valueddecisions can be evaluated in terms of multiple cost functions in a multi-stage optimization [17]. Recently, [60] investigated conflicting objectives inthe construction of decision trees by means of bi-criteria optimization. Sincethe single objectives, such as minimizing average depth or the number ofterminal nodes, are known to be NP-hard, the authors propose a bi-criteriaoptimization approach by means of dynamic programming.

3.5 Support Vector Machines

Support vector machines (SVMs) are another class of supervised machinelearning algorithms that are based on statistical learning and has receivedsignificant attention in the optimization literature [55, 182, 183]. Given atraining data set of size n where each observation x has m features and acorresponding binary label y ∈ −1, 1, the objective of the support vectormachine problem is to identify a hyperplane that separates the two classes ofdata points with a maximal separation margin measured as the width of theband that separates the two classes. The underlying optimization problemis a convex quadratic optimization problem.

3.5.1 Hard Margin SVM

The most basic version of SVMs is the hard margin SVM which assumes thatthere exists a hyperplane w>x+ β = 0 that geometrically separates the datapoints into the two classes such that no data point is missclassified [67]. Thetraining of the SVM model involves finding the hyperplane that separates

17

Page 18: Optimization Models for Machine Learning: A Survey · Optimization Models for Machine Learning: A Survey Claudio Gambella1 Bissan Ghaddar2 Joe Naoum-Sawaya2 1IBM Research Ireland,

the data and whose distance to the closest data point in either of the classes,i.e. margin, is maximized.

The distance of a particular data point xi to the separating hyperplaneis

yi(w>xi + β)

‖w‖

where ‖w‖ denotes the l2-norm. The distance to the closest data point isnormalized to 1

‖w‖ . Thus the data points with labels y = −1 are on one side

of the hyperplane such that w>x + β ≤ 1 while the data point with labelsy = 1 are on the other side w>x + β ≥ 1. The optimization problem forfinding the separating hyperplane is then

max1

‖w‖s.t. yi(w>xi + β) ≥ 1 ∀i ∈ 1, . . . , n

which is equivalent to

min ‖w‖2 (30)

s.t. yi(w>xi + β) ≥ 1 ∀i ∈ 1, . . . , n (31)

that is a convex quadratic problem.Forcing the data to be separable by a linear hyperplane is a strong con-

dition that often does not hold in practice and thus the soft-margin SVMwhich relaxes the condition of perfect separability.

3.5.2 Soft-Margin SVM

When the data is not lienarly seperable, problem (30)–(31) is infeasible. Thusthe soft margin SVM introduces a slack into constraints (31) which allowsthe data points to be on the wrong side of the hyperplane [67]. This slack isminimized as a proxy to minimizing the number of data points that are onthe wrong side. The soft-margin SVM optimization problem is

min ‖w‖2 + C

n∑i

εi (32)

s.t. yi(w>xi + β) ≥ 1− εi ∀i ∈ 1, . . . , n (33)

εi ≥ 0 ∀i ∈ 1, . . . , n. (34)

18

Page 19: Optimization Models for Machine Learning: A Survey · Optimization Models for Machine Learning: A Survey Claudio Gambella1 Bissan Ghaddar2 Joe Naoum-Sawaya2 1IBM Research Ireland,

Another common alternative is to include the error term εi in the objectivefunction by using the squared hinge loss

∑ni ε

2i instead of the hinge loss

∑ni εi.

Hyperparameter C is then tuned to obtain the best classifier.Besides the direct solution of problem (32)–(34) as a convex quadratic

problem, replacing the l2-norm by the l1-norm leads to a linear optimizationproblem generally at the expense of higher misclassification rate[45].

3.5.3 Sparse SVM

Using the l1-norm is also an approach to sparsify w, i.e. reduce the numberof features that a involved in the classification model [45, 194]. An approachknown as the elastic net includes both the l1-norm and the l2-norm in theobjective function and tunes the bias towards one of the norms through ahyperparameter [188, 199]. The number of features can be explicitly modeledin (32)–(33) by using binary variables [56]. A constraint limiting the numberof features to a maximum desired number can be enforced resulting in amixed integer quadratic problem [93].

3.5.4 The Dual Problem and Kernel Tricks

the data points can be mapped to a higher dimensional space through amapping function φ(x) and then a soft margin SVM is applied such that

min ‖w‖2 + Cn∑i

εi (35)

s.t. yi(w>φ(xi) + β) ≥ 1− εi ∀i ∈ 1, . . . , n (36)

εi ≥ 0 ∀i ∈ 1, . . . , n. (37)

Through this mapping, the data has a linear classifier in the higher dimen-sional space however a non-linear separation function is obtained in the orig-inal space.

To solve problem (35)–(37), the following dual problem is first obtained

maxα

n∑i=1

αi −1

2

n∑i,j=1

αiαjyiyjφ(xi)>φ(xj)

n∑i=1

αiyi = 0, ∀i = 1, . . . , n

0 ≤ αi ≤ C, ∀i = 1, . . . , n

19

Page 20: Optimization Models for Machine Learning: A Survey · Optimization Models for Machine Learning: A Survey Claudio Gambella1 Bissan Ghaddar2 Joe Naoum-Sawaya2 1IBM Research Ireland,

where αi are the dual variables of constraints (36). Given a kernel functionK : Rm × Rm → R where K(xi, xj) = φ(xi)>φ(xj), the dual problem is

maxα

n∑i=1

αi −1

2

n∑i,j=1

αiαjyiyjK(xi, xj)

n∑i=1

αiyi = 0, ∀i = 1, . . . , n

0 ≤ αi ≤ C, ∀i = 1, . . . , n

which is convex quadratic optimization problem. Thus only the kernel func-tion K(xi, xj) is required while the explicit mapping φ() is not needed. Thecommon kernel functions include polynomial K(xi, xj) = (xi.xj+1)d where d

is the degree of the polynomial, radial basis function K(xi, xj) = e−‖xi−xj‖2

γ ,

and sigmoidal K(xi, xj) = tanh(βxi>xj + c) [55, 109].

4 Clustering

Data clustering is a class of unsupervised learning approaches that has beenwidely used in practice particularly in applications of data mining, patternrecognition, and information retrieval. Given N unlabeled observations O =O1, . . . , ON, Cluster Analysis aims at finding M subsets C1, . . . , CM,called clusters, which are homogeneous and well separated. Homogeneityindicates the similarity of the observations within the same cluster, while theseparability accounts for the differences between entities of different clusters.The two concepts can be measured via several criteria and lead to differenttypes of clustering algorithms. The number of clusters is typically a tuningparameter to be fixed before determining the clusters. An extensive surveyon data clustering analysis is provided in [115].

In case the entities are points in a Euclidean space, the clustering problemis often modeled as a network problem and shares many similarities withclassical problems in operations research, such as the p-median problem.Following the notation in [105], the following is defined

X: The data matrix X contains the p characteristics of the entities of Ogenerally modeled as a N × p matrix.

D: The dissimilarities matrix D contains the pair-wise dissimilarities dklbetween each pair of data points l and k and is modeled as a N × Nmatrix and typically assumes that dkl ≥ 0, dkl = dlk and dkk = 0.

20

Page 21: Optimization Models for Machine Learning: A Survey · Optimization Models for Machine Learning: A Survey Claudio Gambella1 Bissan Ghaddar2 Joe Naoum-Sawaya2 1IBM Research Ireland,

The dissimilarity dkl is commonly a distance metric. The separation of clus-ters can be measured by a split

s(Cj) = mink:Ok∈Cj ,l:Ol 6∈Cj

dkl

or a cutc(Cj) =

∑k:Ok∈Cj

∑l:Ol 6∈Cj

dkl.

where Cj contains the subset of the data in O that form cluster j.The following measures can also be used to evaluate Homogeneity of each

cluster Cj

Diameter: δ(Cj) = maxk,l:Ok,Ol∈Cj

dkl

Radius: r(Cj) = mink:Ok∈Cj

maxl:Ol∈Cj

dkl,

Star:st(Cj) = mink:Ok∈Cj

∑l:Ol∈Cj

dkl

Clique: cl(Cj) =∑

k,l:Ok,Ol∈Cj

dkl.

If the observations are points of a p-dimensional Euclidean space, the homo-geneity of Cj can be expressed in terms of the sum-of-squares of the Euclideandistances between the elements of Cj and its centroid x such that

ss(Cj) =∑

k:Ok∈Cj

(‖xk − x‖2

)2.

The most commonly adopted types of clustering algorithms are :

• Hierarchical Clustering. The clusters C1, . . . , CM are such that ahierarchy H = P1, P2, . . . , Pq of q ≤ N partitions of O, satisfies

Ci ∈ Pk, Cl ∈ Pl, k > l⇒ Ci ⊂ Cj orCi∩Cj = ∅,∀i 6= j, k, l = 1, 2, . . . , N.

• Partitioning Clustering. The clusters C1, . . . , CM are such that:

– Cj 6= ∅,∀j = 1, 2, . . . ,M

– Ci ∩ Cj,∀i, j = 1, 2, . . . ,M, i 6= j

–M⋃i=1

Cj = O

21

Page 22: Optimization Models for Machine Learning: A Survey · Optimization Models for Machine Learning: A Survey Claudio Gambella1 Bissan Ghaddar2 Joe Naoum-Sawaya2 1IBM Research Ireland,

Hierarchical clustering methods can be divided into agglomerative, di-visive, or global criterion algorithms. In the agglomerative approach, onestarts with N -singleton clusters which are then iteratively merged until alocal criterion is satisfied. One merging algorithm is the single-linkage wherethe two clusters with the smallest inter-cluster dissimilarity are merged. Al-ternatively, the complete-linkage algorithm merges the two clusters for whichthe resulting merged cluster has the smallest diameter.

Divisive hierarchical clustering is done by bi-partitioning an initial clustercontaining all the entities. The choice of the local criterion for dividing theclusters affects the complexity of the algorithm, and can lead to NP-hardformulations [162]. Methods searching for an optimal hierarchy overall aglobal criterion are still not explored.

Partitioning clustering in the one-dimensional Euclidean space can besolved via Dynamic Programming. Another possibility to find partition-ing solutions is to apply branch-and-bound algorithms, for criteria such assum-of-squares , sum-of-cliques , sum-of-stars. A direct branch-and-boundapproach is introduced in [80] for the problem of seeking a consensus parti-tion, where one wants to minimize a total distance of a given set of partition.The problem is formulated as

minN−1∑k=1

N∑l=k+1

dklykl (38)

s.t ykl + ylq − ykq ≤ 1 k = 1, 2, . . . , N − 2 (39)

− ykl + ylq + ykq ≤ 1 l = k + 1, k + 2, . . . , N − 1 (40)

ykl − ylq + ykq ≤ 1 q = l + 1, l + 2, . . . , N (41)

ykl ∈ 0, 1 k = 1, 2, . . . , N − 1, l = k + 1, k + 2, . . . , N, (42)

where ykl = 1 if Ok and Ol belong to the same cluster and ykl = 0 otherwise.The generic partitioning problem may be expressed as a standard set

partitioning problem with a number of columns, representing all the possible

22

Page 23: Optimization Models for Machine Learning: A Survey · Optimization Models for Machine Learning: A Survey Claudio Gambella1 Bissan Ghaddar2 Joe Naoum-Sawaya2 1IBM Research Ireland,

subsets of the set of observations O such that

min2N−1∑t=1

f(Ct)yt (43)

s.t2N−1∑i=1

ajtyt = 1 j = 1, 2, . . . , N (44)

2N−1∑i=1

yt = M (45)

yt ∈ 0, 1 t = 1, . . . , 2N − 1 (46)

where f is a cost function for the clusters, ajt = 1 indicates that Oj belongsto cluster Ct, and yt is a decision variable associated with each cluster Ct.Mathematical programming approaches can also include the decision of thecluster size such as maximizing the modularity of the associated graph [51,52]. In the following, discuss the commonly used mathematical programmingformulations that include Minimum Sum-Of-Squares Clustering, CapacitatedClustering, and k-Hyperplane Clustering.

4.1 Minimum Sum-Of-Squares Clustering (a.k.a. k-Means Clustering)

Minimum Sum-Of-Squares Clustering is one of the most commonly used clus-tering algorithms. It requires to find a number of disjoint clusters for datapoints A = a1, . . . , am in Rn such that distance to the cluster centroids isminimized. Given that, typically, the number k of clusters is fixed, the prob-lem is also referred to as k-Means Clustering. Defining the binary variables

xij =

1 if data point i belongs to cluster j

0 otherwise,

the problem is formulated as the following mixed integer nonlinear program[11]

min∑

i≤N,j≤k

xij‖ai − µj‖22

s.t.∑j≤k

xij = 1 ∀i ≤ N

µj ∈ Rn ∀j ≤ k

xij ∈ 0, 1 ∀i ≤ N,∀j ≤ k.

23

Page 24: Optimization Models for Machine Learning: A Survey · Optimization Models for Machine Learning: A Survey Claudio Gambella1 Bissan Ghaddar2 Joe Naoum-Sawaya2 1IBM Research Ireland,

By adding big-M constraints, the following linearized formulation is obtained

min∑

i≤N,j≤k

dij

s.t.∑j≤k

xij = 1 ∀i ≤ N

dij ≥ ||ai − µj|| −M(1− xij) ∀i ≤ N, ∀j ≤ k

µj ∈ Rn ∀j ≤ k

xij ∈ 0, 1, dij ≥ 0 ∀i ≤ N,∀j ≤ k.

Solution approaches have been proposed in [18, 119]. The case where thespace is not Euclidean is considered in [54]. Alternatively, [170] presentsthe Heterogeneous Clustering Problem where the observations to cluster areassociated with multiple dissimilarity matrices. The problem is formulatedas a mixed-integer quadratically constrained quadratic program.

Another variant is presented in [168] where the homogeneity is expressedby the minimization of the maximum diameter Dmax of the k clusters. Thefollowing nonconvex bilinear mixed-integer program is proposed

min Dmax (47)

s.t. Dl ≥ dijxilxjl ∀i, j, l, i = 1, . . . , N, j = 1, . . . , N, l = 1, . . . , k, (48)

k∑l=1

xil = 1 ∀i = 1, . . . , n, (49)

Dmax ≥ Dl ∀l, l = 1, . . . , k (50)

xij ∈ 0, 1, ∀i, j, i = 1, . . . , n, l = 1, . . . , k (51)

Dl ≥ 0 ∀l, l = 1, . . . , k. (52)

The decision variableDl expresses the diameter of cluster l and xil is activatedif observation i belongs to cluster l.

4.2 Capacitated Clustering

The Capacitated Centered Clustering Problem (CCCP) deals with finding aset of clusters with a capacity limitation and homogeneity expressed by thesimilarity to the cluster centre. A mathematical formulation is given in [152]

24

Page 25: Optimization Models for Machine Learning: A Survey · Optimization Models for Machine Learning: A Survey Claudio Gambella1 Bissan Ghaddar2 Joe Naoum-Sawaya2 1IBM Research Ireland,

as

minN∑i=1

M∑j=1

dijxij (53)

s.tM∑j=1

xij = 1 i = 1, . . . , N, j = 1, . . . ,M (54)

M∑j=1

yj ≤M (55)

xij ≤ yj i = 1, . . . , N, j = 1, . . . ,M (56)

N∑i=1

qixij ≤ Qj j = 1, . . . ,M (57)

xij, yj ∈ 0, 1 i = 1, . . . , N, j = 1, . . . ,M,

where M is an upper bound on the number of clusters, dij is the dissimilaritymeasure between entities i and j, qi is the demand of entity i, Qj is thecapacity of cluster j, variable xij denotes the assignment of observation i tocluster j, and variable yj is equal to 1 if cluster j is used. If the metric dij isa distance and the clusters are homogeneous (Qj = Q ∀j), the formulationalso models the well-known facility location problem. Furthermore, when thetotal number of clusters is fixed to p, the resulting p-CCCP can be solvedusing a clustering search algorithm that is discussed in [57]. Alternativesolution heuristics have also been proposed in [147] and [171] as well as aquadratic programming formulation that is presented in [136].

4.3 k-Hyperplane Clustering

In the k-Hyperplane Clustering (k-HC) problem, a hyperplane, instead ofa center, is associated with each cluster. This is motivated by applicationswhere collinearity and coplanarity relations among the observations are themain interest of the unsupervised learning task, rather than the similarity.Given A = a1, . . . , aN observations in Rn, the k-HC problem requires tofind k clusters, and a hyperplane Hj = x ∈ Rn : wTj x = γj, with wj ∈ Rn

and γj ∈ R, for each cluster j, in order to minimize the sum of the squared2-norm Euclidean orthogonal distances between each observation and thecorresponding cluster.

Given that the orthogonal distance of ai to hyperplane Hj is given by|wTj ai−γj |‖w‖2 , k-HC is formulated in [12] as the following mixed integer quadrat-

25

Page 26: Optimization Models for Machine Learning: A Survey · Optimization Models for Machine Learning: A Survey Claudio Gambella1 Bissan Ghaddar2 Joe Naoum-Sawaya2 1IBM Research Ireland,

ically constraint quadratic problem:

minM∑i=1

δ2i (58)

s.tk∑j=1

xij = 1 i = 1, . . . ,m (59)

δi ≥ (wTj aj − γj)−M(1− xij) i = 1, . . . ,m, j = 1, . . . , k (60)

δi ≥ (−wTj aj + γj)−M(1− xij) i = 1, . . . ,m, j = 1, . . . , k (61)

‖wk‖2 ≥ 1 j = 1, . . . , k (62)

δi ≥ 0 i = 1, . . . ,m (63)

wj ∈ Rn, γj ∈ R j = 1, . . . , k (64)

xij ∈ 0, 1, i = 1, . . . ,m, j = 1, . . . , k (65)

Constraints (60)-(61) model the point to hyperplane distance via linear con-straints. The non-convexity is due to Constraints (62).

5 Deep Learning

Deep Learning (DL) received a first momentum until the 80s due to the uni-versal approximation results [72, 110] where neural networks with a singlelayer with a finite number of units can represent any multivariate continuousfunction on a compact subset Rn with arbitrary precision. However, the com-putational complexity required for training Deep Neural Networks (DNNs)hindered their diffusion by late 90s. Starting 2010, the empirical success ofDNNs has been widely recognized, for several reasons including the devel-opment of advanced processing units, namely GPUs, the advances in theefficiency of training algorithms such as backpropagation, the establishmentof proper initialization parameters, and the massive collection of data (seeSection 7) enabled by new technologies in a variety of domains (e.g., health-care, supply chain management [179], marketing, logistics [186], Internet ofThings). The aim of this section is to describe the decision optimizationproblems associated with DNN architectures. To facilitate the presentation,the notation for the common parameters is provided in Table 1.

The output vector xK of a DNN is computed by propagating the infor-mation from the input layer to each following layer such that

xk = σ(W k−1xk−1 + bk−1) k = 1, . . . , K. (66)

26

Page 27: Optimization Models for Machine Learning: A Survey · Optimization Models for Machine Learning: A Survey Claudio Gambella1 Bissan Ghaddar2 Joe Naoum-Sawaya2 1IBM Research Ireland,

0, . . . , K layers indices.nk number of units, or neurons, in layer kσ element-wise activation function

U(j, k) j-th unit of layer kW k ∈ Rnk×nk+1 weight matrix for layer k

bk ∈ Rnk bias vector for layer kx0j j-th input featurexk output vector of layer k, k > 0 (derived feature)x training datasety testing dataset, with observations yi, i = 1, . . . , N .

Table 1: Notation for DNN architectures.

The output of the DNN is finally evaluated for regression or classificationtasks. In the context of regression, the components of xK can directly rep-resent the response values learned. For a classification problem, the vectorxK corresponds to the logits of the classifier. In order to interpret xK asa vector of class probabilities, functions F such as the logistic sigmoidal orthe softmax can be applied [96]. The classifier C modeled by the DNN thenclassifies an input x with the label correspondent to the maximum activation:

C(x) = arg maxi=1,...,nK

F (xKi ).

The aim of this section is to present the optimization models that areused in DNN. First mixed integer programming models for DNN trainingare introduced in Section 5.1. Adversarial learning and data poisoning arethen discussed in Sections 5.2 and 5.3, respectively. Finally ensemble ap-proaches that train DNNs with multiple activation functions are discussed inSection 5.4.

5.1 DNN Training with Mixed Integer Programming

For a data scientist, a considerable effort may be spent in testing configu-rations of parameters, such as the number of layers and their size, and thelearning rates and epochs for the training algorithm. All the values thatneed to be determined before the training process takes place are called hy-perparameters. The Hyperparameter Optimization (HPO) is typically drivenby the data scientist’s experience, the characteristics of the dataset, or byfollowing heuristic rules. Alternatively HPO can be modeled in a mathemat-ical programming framework such as the approach presented in [78] where abox-constrained model is presented and solved by a derivative-free approach.

27

Page 28: Optimization Models for Machine Learning: A Survey · Optimization Models for Machine Learning: A Survey Claudio Gambella1 Bissan Ghaddar2 Joe Naoum-Sawaya2 1IBM Research Ireland,

Given A the set of valid assignments of values for the hyperparameters, andf a performance metric (e.g., prediction accuracy on a test set), then theHPO requires the solution of

arg maxa∈A

f(a, x). (67)

By bounding the number of hidden layers and their size, and relying on asurrogate model for the stochastically computable objective f , formulation(67) becomes a box-constrained problem.

In terms of activation functions, the rectified linear unit

ReLU : Rn → Rn, ReLU(z) = (max(0, z1), . . . ,max(0, zn)).

is typically the preferred option. If the unit is active, the gradient is strictlypositive, and this favors the adoption of gradient-based optimization algo-rithms for the training of DNNs. In order to learn from non-active units,some generalizations have been proposed: rectification given by the abso-lute value function [117], leaky ReLU [145] max(0, y) + αmin(0, y) witha small α, and parametric ReLU with a learnable α for each component[108]. Other activations can also be done via the element-wise functionsign : R→ R, sign(z) = 1 if z ≥ 0, sign(z) = 0 if z < 0 .

The task of training, a DNN consists of determining the weights W k

and the biases bk that make the model best fit the training data, accordingto a certain measure of training loss. For a regression with Q quantitativeresponses, a common measure of training loss L is the sum-of-squared errorson the testing dataset

Q∑q=1

N∑i=1

(yiq − xKq )2. (68)

For classification with Q classes, the cross-entropy

−Q∑q=1

N∑i=1

yiq log xKq (69)

is preferred.An effective approach to minimize L is by gradient descent, called back-

propagation in this setting. Typically, one is not interested in a proven localminimium of L, as this is likely to overfit the training dataset and yield alearning model with a high variance. Similar to the Ridge regression (see

28

Page 29: Optimization Models for Machine Learning: A Survey · Optimization Models for Machine Learning: A Survey Claudio Gambella1 Bissan Ghaddar2 Joe Naoum-Sawaya2 1IBM Research Ireland,

Section 2), the loss function can include a weight decay term such as

λ

(K−1∑k=0

nk∑i=1

(bki )2 +

K−1∑k=0

nk∑i=1

nk+1∑j=1

(W kij)

2

)(70)

or alternatively a weight elimination penalty

λ

(K−1∑k=0

nk∑i=1

(bki )2

1 + (bki )2

+K−1∑k=0

nk∑i=1

nk+1∑j=1

(W kij)

2

1 + (W kij)

2

). (71)

Function (71) tends to eliminate smaller weights more than (70).Motivated by the considerable improvements of Mixed-Integer Program-

ming solvers, a natural question is how to model a DNN as a MIP. In [87],DNNs with ReLU activation

xk = ReLU(W k−1xk−1 + bk−1) ∀k = 1, . . . , K (72)

are modeled where variables xk express the output vector of layer k, k > 0and x0 is the input vector. The following mixed integer linear problem isproposed

minK∑k=0

nk∑j=1

ckjxkj +

K∑k=1

nk∑j=1

γkj zkj (73)

s.t.

nk−1∑i=1

wk−1ij xk−1

i + bk−1j = xkj − skj ∀k = 1, . . . , K, j = 1, . . . , nk (74)

zkj = 1→ xkj ≤ 0 ∀k = 1, . . . , K, j = 1, . . . , nk (75)

zkj = 0→ skj ≤ 0 ∀k = 1, . . . , K, j = 1, . . . , nk (76)

lb0j ≤ x0

j ≤ ub0j ∀j = 1, . . . , n0 (77)

lbkj ≤ xkj ≤ ubkj ∀k = 1, . . . , K, j = 1, . . . , nk (78)

lbk

j ≤ skj ≤ ubk

j ∀k = 1, . . . , K, j = 1, . . . , nk (79)

where zkj are binary activation variables and skj are slack variables that cor-respond with each unit U(j, k). Depending on the application, different ac-tivation weights ckj and activation costs γkj can also be used for each U(j, k).

The proposed MIP is feasible for every input vector x0, since it computesthe activation in the subsequent layers. Constraints (75) and (76) are indi-cator constraints, which are known to generate very hard optimization prob-lems, because of their weak continuous relaxation [40]. Several optimization

29

Page 30: Optimization Models for Machine Learning: A Survey · Optimization Models for Machine Learning: A Survey Claudio Gambella1 Bissan Ghaddar2 Joe Naoum-Sawaya2 1IBM Research Ireland,

solvers can directly handle such kind of constraints, and the tightness of theprovided bounds is crucial for their effectiveness. In [87], a bound-tighteningstrategy to reduce the computational times is proposed and the largest DNNtested with this approach is a 5-layer DNN with 20+20+10+10+10 internalunits.

Problem (73)–(79) can model several tasks in DL. These include

• Pooling operations: The average and the maximum operators

Avg(xk) =1

nk

nk∑i=1

xki

Max(xk) = max(xk1, . . . , xknk)

used for example in Convolutional Neural Networks, can be incorpo-rated in the hidden layers. In the case of max pooling operations,additional indicator constraints are required.

• Maximizing the unit activation: By maximizing the objective function(73), one can find input examples x0 that maximize the activation of theunits. This may be of interest in applications such as the visualizationof image features.

• Building crafted adversarial examples: Given an input vector x0 labeledas l by the DNN, the search for perturbations of x0 that are classified asl′ 6= l (adversarial examples), can be conducted by adding conditionson the activation of the final layer K and minimizing the perturbation.

• Training: In this case, the weights and biases are decision variables.The resulting bilinear terms in (74) and the considerable number ofdecision variables in the formulation limit the applicability of (73)-(79)for DNN training. In addition, searching for proven optimal solutionsis likely to lead to overfitting.

Another attempt in modelling DNNs via MIPs is provided by [125], inthe context of Binarized Neural Networks (BNNs). BNNs are characterizedby having binary weights −1,+1 and by using the sign function for neuronactivation [69]. Given that the established search strategies for adversarialexamples (e.g., fast gradient sign method, projected gradient descent (PGD)attack) rely on gradient information, the discrete and non-differentiable ar-chitecture of BNNs makes them quite robust to such attacks. In [125], a MIPis proposed for finding adversarial examples in BNNs by maximizing the dif-ference between the activation of the targeted label l′ and the predicted label

30

Page 31: Optimization Models for Machine Learning: A Survey · Optimization Models for Machine Learning: A Survey Claudio Gambella1 Bissan Ghaddar2 Joe Naoum-Sawaya2 1IBM Research Ireland,

l of the input x0, in the final layer (namely, maxxKl′ − xKl ). Contrary to[87], the MIP of [125] does not impose limitations on the search of adversar-ial examples, apart from the perturbation quantity. In terms of optimalitycriterion however, searching for the proven largest misclassified example isdifferent from finding a targeted adversarial example. Furthermore, whilethere is interest in minimally perturbed adversarial examples, suboptimalsolutions corresponding to adversarial examples (i.e., xKl′ ≥ xKl ) may have aperturbation smaller than that of the optimal solution.

Besides [87], other MIP frameworks have been proposed to model certainproperties of neural networks in a bounded input domain. In [59], the prob-lem of computing maximum perturbation bounds for DNNs is formulated asa MIP, where indicator constraints and disjunctive constraints are modeledusing constraints with big-M coefficients [100]. The maximum perturbationbound is a threshold such that the perturbed input may be classified cor-rectly with a high probability. A restrictive misclassification condition isadded when formulating the MIP. Hence, the infeasibility of the MIP doesnot certify the absence of adversarial examples. In addition to the ReLUactivation, the tan−1 function is also considered by introducing quadraticconstraints and several heuristics are proposed to solve the resulting prob-lem. In [180], a model to formally measure the vulnerability to adversarialexamples is proposed (the concept of vulnerability of NN is discussed in Sec-tions (5.2.1) and (5.2.2)). A tight formulation for the resulting non-linearitiesand a novel presolve technique are introduced to limit the number of binaryvariables and improve the numerical conditioning. However, the misclas-sification condition is not explicitly defined but is rather left in the form“different from” and not explicitly modeled using equality/inequality con-straints. In [172], the aim is to count or bound the number of linear regionsthat a piecewise linear classifier represented by a DNN can attain. Assumingthat the input space is bounded and polyhedral, the DNN is modeled as aMIP. The contributions of adopting a MIP framework in this context arelimited, especially in comparison with the results achieved in [151].

MIP frameworks can also used to formulate the verification problem forneural networks as a satisfiability problem. In [121], a satisfiability modulotheory solver is proposed based on an extension of the simplex method toaccommodate the ReLu activation functions. In [50], a branch-and-boundframework for verifying piecewise-linear neural networks is introduced. Fora recent survey on the approaches for automated verification of NNs, thereader is referred to [135].

31

Page 32: Optimization Models for Machine Learning: A Survey · Optimization Models for Machine Learning: A Survey Claudio Gambella1 Bissan Ghaddar2 Joe Naoum-Sawaya2 1IBM Research Ireland,

5.2 Adversarial Learning

Despite the wide interest in deep learning, the integration of neural networksinto safety and security related applications necessitates thorough evaluationand research. A large number of contributions in the literature showed thepresence of perturbed examples, also called adversarial examples, causingclassification errors [36, 177]. Malicious attackers can thus exploit securityfalls in a general classifier. In case the attacker has a perfect knowledgeof the classifier’s architecture (i.e., the result of the training phase), then awhite-box attack can be performed. Black-box attacks are instead performedwithout full information of the classifier. The attention on adversarial ex-amples is also motivated by the transferability of the attacks to differenttrained models [130, 181]. Adversarial learning then emerges as a frameworkto devise vulnerability attacks for classification models [142].

From a mathematical perspective, such security issues have been formerlyexpressed via min-max approaches where the learner’s and the attacker’s lossfunctions are antagonistic [73, 94, 131]. Non-antagonistic losses are formu-lated as a Stackelberg equilibrium problem involving a bi-level optimizationformulation [48], or in a Nash equilibrium approach [47]. These theoret-ical frameworks rely on the assumption of expressing the actual problemconstraints in a game-theory setting, which is often not a viable option forreal-life applications. Attacks for linear and non-linear classifiers with a dif-ferentiable discriminant function are investigated in [35]. The evasion of theclassifier is considered in the test set. The aim of the attacker is to find anexample y∗ close to a malicious sample y0, according to a metric d, such thatthe discriminant function g of the classifier is minimized, i.e.

minx′∈χ

g(x′) (80)

s.t. d(x′, x) ≤ δ. (81)

An estimation g′ of the discriminant function may also be adopted in placeof g in (80). The non-linear optimization model can be solved via gradientdescent or quadratic techniques, such as Newton’s method, BFGS, or L-BFGS.

As a particular case of adversarial learning, [177] discusses the vulnerabil-ity of neural networks to adversarial examples. Such inputs are obtained bysmall perturbations on the original training dataset and yield selected adver-sary output. In the context of image classifications, imperceptible modifica-tions of the input images can lead to severe misclassifications. The linearityof the neural network can increase the vulnerability [113].

As discussed in the next section, evasion attacks can be conducted in atargeted or untargeted fashion [53]. In the targeted setup, the attacker aims

32

Page 33: Optimization Models for Machine Learning: A Survey · Optimization Models for Machine Learning: A Survey Claudio Gambella1 Bissan Ghaddar2 Joe Naoum-Sawaya2 1IBM Research Ireland,

to achieve a classification with a chosen class target, while the untargetedmisclassification is not constrained to a specific output class.

5.2.1 Targeted attacks

Given a neural network classifier f : χ ∈ Rn0 → Υ = 1, . . . , nK and a targetlabel l ∈ Υ, a relevant problem in the framework of targeted attacks is that offinding a minimum perturbation r of a given input x0, such that f(x0+r) = l.Clearly, if the target l coincides with the label lx that classifies x0 accordingto f , the problem has the trivial solution r = 0 and no misclassification takesplace.

In [177], the minimum adversarial problem for targeted attacks is formu-lated as a box-constrained problem

minr∈Rn0

‖r‖2 (82)

s.t. f(x0 + r) = l (83)

x0 + r ∈ [0, 1]n0

. (84)

The condition (84) ensures that the perturbed example x0 + r belongs to theset χ of admissible inputs, in case of normalized images with pixel valuesranging from 0 to 1. Assuming l 6= lx, the difficulty of solving Problem (82)–(84) to optimality depends on the complexity of the classifier f . Denotingby Lf → χ × 1, . . . , nK → R+ the loss function for training f , [177]approximates the problem with a box-constrained L-BFGS

minr∈Rn0

c|r|+ Lf (x0 + r, l) (85)

x0 + r ∈ [0, 1]n0

. (86)

The approximation is exact for convex loss functions, and can be solvedvia a line search algorithm on c > 0. In [101], c is fixed such that theperturbation is minimized on a sufficiently large subset ψ, and the meanprediction error rate of f(xi + ri) i ∈ ψ is greater that a threshold τ . In [53],the L2 distance metric of formulation (85)–(86) is generalized to the norm-Lpwith p ∈ 0, 2,∞ and several choices of functions F satisfying f(x0 + r) = lif and only if F(x0 + r) ≤ 0 are introduced. The equivalent formulation isthen

minr∈Rn0

‖r‖p + γF(x0 + r, l) (87)

x0 + r ∈ [0, 1]n0

, (88)

33

Page 34: Optimization Models for Machine Learning: A Survey · Optimization Models for Machine Learning: A Survey Claudio Gambella1 Bissan Ghaddar2 Joe Naoum-Sawaya2 1IBM Research Ireland,

where γ is a constant that can be determined by binary search such thatthe solution r∗ satisfies the condition F(x0 + r∗) ≤ 0. The authors proposestrategies for applying optimization algorithms (such as Adam [127]) thatdon’t support the box constraints (88) natively. Novel classes of attacks arefound for the considered metrics.

5.2.2 Untargeted attacks

In untargeted attacks, one searches for adversarial examples x′ close to theoriginal input x0 for which l′x 6= l0x, without targeting a specific label for x′.Given that the only aim is misclassification, untargeted attacks are deemedless powerful than the targeted counterpart, and received less attention inthe literature.

A mathematical formulation to find minimum adversarial distortion foruntargeted attacks is proposed in [180]. Assuming that the classifier f isexpressed by the set of functions fi associated with each label 1, . . . , nK,and given a distance metric d, then a perturbation r for an untargeted attackis found by solving

minr

d(r) (89)

s.t. arg maxi∈Υ

f i(x0 + r) 6= lx0 (90)

x0 + r ∈ χ. (91)

This formulation can easily accommodate targeted attacks in a set T byreplacing (90) with arg maxif i(x0 + r) ∈ T . The most commonly adoptedmetrics in literature are the L1, L2, and L∞ norms which as shown in [180],can all be expressed with continuous variables. The L2 norm makes theobjective function of the outer-level optimization problem quadratic.

Problem (89)–(91) can also be expressed as the bi-level optimization prob-lem

minr,z

d(r) (92)

s.t. z − lx ≤ −ε+My (93)

z − lx ≥ ε− (1− y)M (94)

z ∈ arg maxi∈Υ

fi(x0 + r) (95)

x0 + r ∈ χ. (96)

Constraints (92)–(93) express the condition of misclassification z 6= lx usingconstraint with big-M coefficients. The complexity of the inner-level opti-mization problem is dependent on the activation functions. Given that the

34

Page 35: Optimization Models for Machine Learning: A Survey · Optimization Models for Machine Learning: A Survey Claudio Gambella1 Bissan Ghaddar2 Joe Naoum-Sawaya2 1IBM Research Ireland,

upper-level feasibility set χ is continuous and the lower-level variable i rangeson a discrete set, the problem is in fact a continuous discrete linear bilevelprogramming problem [184] which are arguably the hardest class of bilevellinear programming formulations. Reformulations or approximations havebeen proposed in [74, 83, 102, 169, 193].

Recalling the definition given above, untargeted attacks finds a targetedattack which is ‘close enough‘ to the original input x. A trivial way to obtainan untargeted adversarial attack is then to solve Problem (82)–(84) for everyl 6= lx and then select the solution x′ closest to x, according to a given metric.This approach would however go beyond the purpose of untargeted evasion,since it would perform a targeted attack for every possible incorrect labell 6= lx.

We introduce an alternative mathematical formulation for finding untar-geted adversarial examples which generalizes condition (83). Given an un-targeted example x′ = x0 +r with lx′ 6= lx and the scoring functions fi, i ∈ Υ,the condition is equivalent to f ∗(x) 6= arg maxi∈Υ fi(x

′), which can be thenexpressed as

∃ i ∈ Υ s.t. f i(x+ r) > f lx(x). (97)

Condition (97) is an existence condition, which can be formalized by addingthe functions σi(r) = Relu(f i(x+ r)− f lx(x)), i ∈ Υ, with a composition ofthe scoring functions fi and the ReLU functions∑

i∈Υ

σi(r) > ε, (98)

where ε > 0 enforces that at least one σi function has to be activated. There-fore, untargeted adversarial examples can be found by modifying formulation(82)–(84) by adding m functions σi(r) and replacing condition (83) with thelinear condition (97). The complexity of this approach depends on the scor-ing functions fi. The extra ReLu functions can be expressed as a mixedinteger problem as reviewed in Section 5.1.

5.2.3 Models resistant to adversarial attacks

Another interesting line of research motivated by adversarial learning dealswith adversarial training, which consists of techniques to make a neural net-work robust to adversarial attacks. A widely known defense technique is toaugment the training data with adversarial examples; this however does notoffer robustness guarantees on novel kinds of attacks.

The problem of measuring robustness of a neural network is consideredin [22]. Three robustness metrics are proposed. The pointwise robustness

35

Page 36: Optimization Models for Machine Learning: A Survey · Optimization Models for Machine Learning: A Survey Claudio Gambella1 Bissan Ghaddar2 Joe Naoum-Sawaya2 1IBM Research Ireland,

evaluates if the classification of x0 is robust for “small” perturbations, theadversarial frequency measures the portion of training inputs for which f failsto be (x0, ε)-robust and finally the adversarial severity measures the extentby which f fails to be robust at x0, conditioned on f not being (x0, ε)−robust.Formally, f is said to be (x0, ε)− robust if:

lx′ = lx0 , ∀x′‖x′ − x0‖∞ ≤ ε (99)

The pointwise robustness ρ(f, x0) is the minimum ε for which f fails tobe (x0, ε)-robust

ρ(f, x) = infε ≥ 0 f is not(w, ε) -robust. (100)

As detailed in [22], ρ is computed by expressing (100) as a constraint satis-fiability problem. The concept of pointwise robustness is more general thanthe perturbation bound defined in [59] since no restrictions are imposed onthe activation of each layer.

The adversarial training of neural network by a robust optimization (RO)approach is investigated in [146]. In the RO framework, a decision makersearches for solution policies that avoid a worst-case scenario [24]. In thissetting, the goal is to train a neural network to be resistant to all the attacksbelonging to a certain class of perturbed inputs. Particularly, the adversar-ial robustness with a saddle point (min-max) formulation is studied in [146]which is obtained by augmenting the Empirical Risk Minimization paradigm.Given D the data distribution of the training samples ξ ∈ x, and the cor-responding labels l ∈ Υ, let θ ∈ Rp be the set of model parameters to belearned. Let (θ, ξ, l) be the loss function considered in the training phase(e.g., the cross-entropy loss) and S be the set of allowed perturbations (e.g.,an L∞ ball). The aim is to minimize the worst adversarial loss on the set ofinputs perturbed by S

minθ

E(ξ,l)

[maxr∈SL(θ, ξ + r, l)

](101)

The saddle point problem (101) is viewed as the composition of an innermaximization and an outer minimization problem. The inner problem corre-sponds to attacking a trained neural network by means of the perturbationsS. The outer problem deals with the training of the classifier in a robustmanner. The importance of formulation (101) stems both from the formal-ization of adversarial training and from the quantification of the robustnessgiven by the objective function value on the chosen class of perturbations.

The value ρ(θ) = E(x,y)

[maxr∈S L(θ, x + r, y)

]quantifies the magnitude of

the loss.

36

Page 37: Optimization Models for Machine Learning: A Survey · Optimization Models for Machine Learning: A Survey Claudio Gambella1 Bissan Ghaddar2 Joe Naoum-Sawaya2 1IBM Research Ireland,

A similar robust approach is investigated in [173]. A perturbation set Siis adopted for each training example ξi. The aim is then to optimize

minθ

m∑i=1

maxri∈SiL(θ, ξi + ri, li) (102)

As detailed in [173], an alternating ascent and descent steps procedure canbe used to solve (102) with the loss function approximated by the first-order Taylor expansion around the training points. The methodology canalso handle another loss function, the so-called Manifold Tangent Classifier(MTC) [165]. The MTC adopts the following loss function

J(θ, ξ, l) = L(θ, ξ, l) + β∑u∈Bξ

(〈u,∇ξf(ξ)〉)2, (103)

where Bξ is a basis of the hyperplane that is tangent to the data manifold atξ where β is a weight parameter and f(ξ) is the label for ξ classified by f .

Adversarial training is modeled in [137] as

αL(θ, ξ, l) + (1− α)L(θ, ξ, l) (104)

where L(θ, ξ, l) is the loss function used to generate adversarial examples, ξis the example with label l, ξ is a perturbation of ξ, and α is a constant usedto control the strength of the adversary. In the fast gradient sign method,perturbation ξ for example ξ with label l is computed in the direction of thegradient

ξ = ξ + ε · sign(∇ξL(θ, ξ, l))In order to improve the robustness of the neural network, a distortion

termL−1∑i=1

βi ·hi∑j=1

dij can be added to (104) where L− 1 is the number of the

normalization layers of the trained network, hi is the number of features inthe i-th layer, dij is the distortion of the value of the j-th normalized featurein the i-th layer, and βi is a balancing constant.

5.3 Data poisoning

Another class of attacks is that of Data Poisoning. In this context, the at-tacker hides corrupted, altered or noisy data in the training dataset. Thiscauses failures in the accuracy of the training process for classification pur-poses. Data Poisoning was first studied by [37] to degrade the accuracy ofSVMs. In [176], the upper bounds on the efficacy of a class of causative datapoisoning attacks is studied. The causative attacks [21] proceed as follow:

37

Page 38: Optimization Models for Machine Learning: A Survey · Optimization Models for Machine Learning: A Survey Claudio Gambella1 Bissan Ghaddar2 Joe Naoum-Sawaya2 1IBM Research Ireland,

• n0 data points are drawn by the data-generating distribution p∗, pro-ducing a clean dataset xC ,

• the attacker adds malicious examples xP in the training dataset,

• the defender learns model θ from the full dataset x = xC∪xP , reportinga test loss L(θ).

As discussed in [176], data sanitation defenses to limit the increase of testloss include two steps: (i) data cleaning (e.g., removing outliers which arelikely to be poisoned examples) producing a feasible set F ⊂ χ× υ, and (ii)minimizing a margin-based loss on the cleaned dataset. The learned modelis then θ = arg minθ∈Θ L(θ; (xC ∪ xP ) ∩ F).

Data poisoning can be viewed as games between the attacker and thedefender players where the game is based on the conflicting objectives ofthe two players on the test loss. At each step, the attacker is required tomaximize the test loss over (ξ, l) ∈ F . In the task of binary classification forSVM, the maximization of hinge loss for each y ∈ +1,−1 is expressed bythe quadratic programming problem

minξ∈Rd

yd∑i=1

θiξi (105)

s.t.‖ξ − µ+y‖22 ≤ r2

y (106)

|〈ξ − µ+y, µ+y − µ−y〉| ≤ sy. (107)

Constraint (106) removes the data points that are outside a sphere withcentroid µy and radius ry (sphere defense). Constraint (107) eliminates thepoints that are further than a distance tolerance sy from the line betweenthe centroids µ+y = E[ξ|y = +1] and µ−y = E[ξ|y = −1] (slab defense).

Poisoning attacks can also be performed in semi-online or online fash-ion, where training data is processed in a streaming manner, and not inbatches. Compared to the offline case, the attacker has the knowledge ofthe order in which the training data is obtained. In the semi-online context,the attacker can modify part of the training data stream so as to maximizethe classification loss, and the evaluation of the objective (loss) is done onlyat the end of the training. Instead, in the online modality, the classifier isupdated and evaluated during the training process. In [189], a white-boxattacker’s behavior in online learning for a linear classifier wTx with binarylabels y ∈ −1,+1 is formulated. The data stream S arrives in T instants(S = S1, . . . , ST) and the classification weights are updated using an onlinegradient descent algorithm [198] such that

wt+1 = wt − ηt(∇L(wt, (xt, yt))) +∇Ω(wt),

38

Page 39: Optimization Models for Machine Learning: A Survey · Optimization Models for Machine Learning: A Survey Claudio Gambella1 Bissan Ghaddar2 Joe Naoum-Sawaya2 1IBM Research Ireland,

where Ω is regularizer, ηt is the step length of the iterate update, and L is aconvex loss function. The semi-online attacker can be formulated as

maxS∈FT

g(wT ) (108)

s.t.|S \ Strain| ≤ K, (109)

wt = w0 −t−1∑τ=0

ητ (∇L(ωτ , Sτ ) +∇L(wτ )), 1 ≤ t ≤ T (110)

where FT is the cleaned dataset at time T , g is the attacker’s objective (e.g.classification error on the test set), | · | denotes the cardinality of a set, Strain

is an input data stream, and K is a defined upper bound on the number ofchanged examples in Strain.

With respect to the offline case, the estimated weight vector wt is a com-plex function of the data stream S, which makes the gradient computationmore challenging and the KKT conditions do not hold. The optimizationproblem is simplified by considering a convex surrogate for the objectivefunction, given by the logistic loss. In addition, the expectation is conductedover a separate validation data set and a label inversion procedure is im-plemented to cope with the multiple local maxima of the classifier function.The fully-online case can also be addressed by replacing the objective witht∑t=1

f(wt)

5.4 Activation ensembles

Another important research direction in neural network architectures inves-tigates the possibility of adopting multiple activation functions inside thetraining process. Some examples in this framework are given by the max-out units [97], returning the maximum of multiple linear affine funtions, andthe network-in-network paradigm [138] where the classical ReLU function isreplaced by fully connected network. In [10], adaptive piecewise linear acti-vation functions are learned during training. Specifically, for each unit i andunit input x, activation hi(x) is considered as

hi(x) = max(0, x) +S∑s=1

asi max(0,−x+ bsi ),

where the number of hinges S is a hyperparameter to be fixed in advance,while the variables asi , b

si have to be learned. For large enough S, hi(x) can

approximate a class of continuous piecewise-linear functions [10].

39

Page 40: Optimization Models for Machine Learning: A Survey · Optimization Models for Machine Learning: A Survey Claudio Gambella1 Bissan Ghaddar2 Joe Naoum-Sawaya2 1IBM Research Ireland,

In a more general perspective, Ensemble Layers are proposed in [106]to consider multiple activation functions in a neural network. The idea isto embed a family of activation functions H1, . . . , Hm and let the networkitself choose the magnitude of the activations during the training. To promoterelatively equal contribution to learning, the activation functions need to bescaled to the interval [0, 1]. To measure the impact of the activation in theneural network, each function Hj is associated with a continuous variable αj.The resulting activation for neuron i is then given by the weighted sum ofthe scaled functions

hji (z) =Hj(z)−min

ξ∈x(Hj(zξ,i))

maxk∈τ

(Hj(zξ,i))−minξ∈x

(Hj(zξ,i)) + ε(111)

Hi(z) =m∑j=1

αjihji (z) (112)

where zξ,i is the i-th output associated with training example ξ and ε isa small tolerance. To achieve the desired scaling, it is sufficient to let ξvary over a minibatch in x [106]. Equation (112) can be modified so asto allow the network to leave the activations to the original state which is

achieved by adding the parameters η and δ in Hi(z) =m∑j=1

αji (ηjhji (z) + δj) .

The magnitude of the weights αj is then limited in a projection subproblemwhere for each neuron, the network should choose an activation function andtherefore all the weights should sum to 1. If αj are the weight values obtainedby gradient descent while training, then the projected weights are found bysolving

minα

m∑j=1

1

2(αj − αj)2 (113)

s.t.m∑j=1

αj = 1 (114)

αj ≥ 0, j = 1, . . . ,m. (115)

6 Emerging paradigms

6.1 Machine Teaching

An important aspect of machine learning is the size of the training set to suc-cessfully learn a model. As such, the Teaching Dimension problem identifies

40

Page 41: Optimization Models for Machine Learning: A Survey · Optimization Models for Machine Learning: A Survey Claudio Gambella1 Bissan Ghaddar2 Joe Naoum-Sawaya2 1IBM Research Ireland,

the minimum size of a training set to correctly teach a model [95, 174].The teaching dimension of linear learners, such as ridge regression, SVM,

and logistic regression has been recently presented in [139]. With the intentto generalize the teaching dimension problem to a variety of teaching tasks,[196] and [197] provide the Machine Teaching framework. Machine Teachingis essentially an inverse problem to Machine Learning. While, in a learningtask, the training dataset x is given and the model parameters θ = θ∗ have tobe determined, the role of a teacher is to let a learner approximately learn amodel θ∗ by providing a proper set of training examples (also called teachingdataset in this context). A Machine Teaching task requires the selection of3 components:

• A Teaching Risk associated with the approximation error of the learner.

• A Teaching Cost expressing the convenience of the teaching dataset,from the prospective of the teacher, weighted by a regularization termη.

• A learner L.

Formally, machine teaching can be casted as a bilevel optimization prob-lem

minx,θ

TR(θ) + ηTC(x) (116)

s.t. θ = L(x), (117)

where the upper optimization is the teacher’s problem and the lower opti-mization L(x) is the learner’s machine learning problem. TR(θ) and TC(x)are the teaching risk and the teaching cost, respectively. The teacher isaware of the learning algorithm, which could be a classifier (such as those ofSection 3) or a deep neural network. Machine teaching encompasses a widevariety of applications, such as data poisoning attacks, computer tutoringsystems, and the two-players games, where the teacher is an encoder of theteaching set, and the learning identifies a model θ.

The optimization problem is NP-hard due to its combinatorial and bilevelnature. The teacher is typically optimizing over a discrete space of teachingsets, hence for some problem instances, the submodularity properties of theproblem may be of interest. For problems with a small teaching set, itis possible to formulate the teaching problem as a mixed integer nonlinearprogram. The computation of the optimal training set remains, in general,an open problem, and is especially challenging in the case where the learning

41

Page 42: Optimization Models for Machine Learning: A Survey · Optimization Models for Machine Learning: A Survey Claudio Gambella1 Bissan Ghaddar2 Joe Naoum-Sawaya2 1IBM Research Ireland,

algorithm does not have a closed-form solution with respect to the trainingset [196].

Alternatively, the single level formulation of the problem minimizes theteaching cost and allows for either approximate or exact teaching

minx,θ

TC(x) (118)

s.t. TR(θ) ≤ ε (119)

θ = L(x). (120)

Given a teaching budget, the resulting formulation is then

minx,θ

TR(θ) (121)

s.t. TC(x) ≤ B (122)

θ = L(x). (123)

For the teaching dimension problem, the teaching cost is the cardinalityof the teaching dataset, namely its norm 0. If the empirical minimization lossL is guiding the learning process, and λ is the regularization weight, thenteaching dimension can be formulated as

minx,θ

η‖x‖0 (124)

s.t. ‖θ − θ∗‖ ≤ ε (125)

θ = argminθ∈Θ

∑ξ∈x

L(ξ, θ) + λ‖θ‖2. (126)

We note that the order of the training items matters in the sequential learningcontext [157] while it is not relevant for batch learners. Sequential teachingcan be considered in online and adaptive frameworks such as stochastic gra-dient descent algorithms and reinforcement learners.

The machine teaching framework can also address the case of multiplelearners λ taught by the same teacher. The challenge is to devise a commonteaching set for all learners. Several mathematical formulations can be de-vised, depending on the optimization criterion chosen by the teacher whichinclude

• Minimax risk, which optimizes for the worst learner

minx,θλ

maxλ

TR(θλ) + ηTC(x) (127)

s.t. θλ = Lλ(x). (128)

42

Page 43: Optimization Models for Machine Learning: A Survey · Optimization Models for Machine Learning: A Survey Claudio Gambella1 Bissan Ghaddar2 Joe Naoum-Sawaya2 1IBM Research Ireland,

• Bayes risk, which optimizes for the average learner

minx,θλ

∫λ

f(λ)TR(θλ)dλ+ ηTC(x) (129)

s.t. θλ = Lλ(x), (130)

where f(λ) is a known probability distribution of the learner.

Machine teaching approaches tailored to specific learners have also beenexplored in the literature. In [195], a method is proposed for the Bayesianlearners, while [158] focuses on Generalized Context Model (GCM) learners.In [150], the bilevel optimization of machine teaching is explored to deviseoptimal data poisoning attacks for a broad family of learners (i.e., SVM, lo-gistic regression, linear regression). The attacker seeks the minimum trainingset poisoning to attack the learned model. By using the KKT conditions ofthe learner’s problem, the bilevel formulation is turned into a single leveloptimization problem, and solved using a gradient approach.

6.2 Empirical Model Learning

Empirical model learning (EML) aims to integrate machine learning modelsin combinatorial optimization in order to support decision-making in high-complexity systems through prescriptive analytics. This goes beyond thetraditional what-if approaches where a predictive model (e.g., a simulationmodel) is used to estimate the parameters of an optimization model. Ageneral framework for an EML approach is provided in [141] and requiresthe following

• A vector x of decision variables with xi feasible over the domain Di.

• A mathematical encoding h of the Machine Learning model.

• A vector z of observables obtained from h.

• Logical predicates gj(x, z) such as mathematical programming inequal-ities or combinatorial restrictions in constraint programming.

• A cost function f(x, z).

EML then solves the following optimization problem

min f(x, z) (131)

s.t. gj(x, z) ∀j ∈ J (132)

z = h(x) (133)

xi ∈ Di ∀xi ∈ x. (134)

43

Page 44: Optimization Models for Machine Learning: A Survey · Optimization Models for Machine Learning: A Survey Claudio Gambella1 Bissan Ghaddar2 Joe Naoum-Sawaya2 1IBM Research Ireland,

The combinatorial structure of the problem is defined by (131), (132), and(134) while (133) embeds the empirical machine learning model in the com-binatorial problem. Embedding techniques for neural networks and decisiontrees are presented in [141] using four combinatorial optimization approachesthat include mixed integer non-linear programming, constraint programming,and SAT Modulo Theories, and local search.

7 Useful resources

This section provides a summary of resources of interest for research in ma-chine learning. Table 2 lists the datasets used by the papers that are dis-cussed in this review article. The details that are provided include the name,a brief description, the number of instances, the default task that can beaccomplished, and the year of appearance. For each dataset, Table 3 alsoprovides a bibliographic reference, a url, and the list of papers in this reviewarticle that use it. MNIST and CIFAR-10 are by far the most commonlyused datasets. However, as pointed out by several researchers, classical ma-chine learning tools can achieve 97.5% accuracy on MNIST and thereforemore complex datasets should be used as benchmark. Image datasets similarto MNIST but more complex include Fashion-MNIST and EMNIST. Also alarge repository is [77] which currently maintains 442 datasets. Another im-portant list of resources are the libraries and frameworks to perform machinelearning and deep learning tasks. These are provided in Table 4. MXnet isknown to be one of the most scalable frameworks.

8 Conclusion

Mathematical programming constitutes a fundamental aspect of many ma-chine learning models where the training of these models is a large scaleoptimization problem. This paper surveyed a wide range of machine learn-ing models namely regression, classification, clustering, and deep learningas well as the new emerging paradigms of machine teaching and empiricalmodel learning. The important mathematical optimization models for train-ing these machine learning models are presented and discussed. Exploitingthe large scale optimization formulations and devising model specific solutionapproaches is an important line of research particularly benefiting from thematurity of commercial optimization software. The nonlinearity of the mod-els, the associated uncertainty of the data, as well as the scale of the problemshowever represent some of the very important and compelling challenges to

44

Page 45: Optimization Models for Machine Learning: A Survey · Optimization Models for Machine Learning: A Survey Claudio Gambella1 Bissan Ghaddar2 Joe Naoum-Sawaya2 1IBM Research Ireland,

the mathematical optimization community. Furthermore, bilevel formula-tions play a big role in adversarial learning tasks [104], including adversarialtraining, data poisoning and neural network robustness.

While this survey does not discuss numerical optimization techniquessince they were recently surveyed in [43, 71, 190], we note the fundamentalrole of the stochastic gradient algorithm [166] and the alternating directionmethod of multipliers [44] on large scale machine learning. We also highlightas well the potential impact of machine learning on advancing the solutionapproaches of mathematical programming [25].

References

[1] CIFAR-10 and CIFAR-100 database. https://www.cs.toronto.edu/

~kriz/cifar.html.

[2] EMNIST database. https://www.nist.gov/itl/iad/image-group/

emnist-dataset.

[3] Imagenet database. http://www.image-net.org/.

[4] Imagenet large scale visual recognition challenge 2012. http://

image-net.org/challenges/LSVRC/2012/browse-synsets.

[5] Labeled faces in the wild database. http://vis-www.cs.umass.edu/

lfw/.

[6] MNIST database. http://yann.lecun.com/exdb/mnist/.

[7] Deep Learning for Java. open-source, distributed, scientific computingfor the JVM, 2017. https://deeplearning4j.org/.

[8] Pytorch: Tensors and dynamic neural networks in python with stronggpu acceleration., 2017. https://pytorch.org/.

[9] Theano website, 2017. http://deeplearning.net/software/

theano/.

[10] Forest Agostinelli, Matthew Hoffman, Peter Sadowski, and PierreBaldi. Learning activation functions to improve deep neural networks.arXiv preprint arXiv:1412.6830, 2014.

[11] Daniel Aloise, Pierre Hansen, and Leo Liberti. An improved columngeneration algorithm for minimum sum-of-squares clustering. Mathe-matical Programming, 131(1):195–220, Feb 2012.

45

Page 46: Optimization Models for Machine Learning: A Survey · Optimization Models for Machine Learning: A Survey Claudio Gambella1 Bissan Ghaddar2 Joe Naoum-Sawaya2 1IBM Research Ireland,

[12] Edoardo Amaldi and Stefano Coniglio. A distance-based point-reassignment heuristic for the k-hyperplane clustering problem. Eu-ropean Journal of Operational Research, 227(1):22–29, 2013.

[13] Massih Amini, Nicolas Usunier, and Cyril Goutte. Learning from multi-ple partially observed views-an application to multilingual text catego-rization. In Advances in neural information processing systems, pages28–36, 2009.

[14] M. Azad and M. Moshkov. Minimization of decision tree depth formulti-label decision tables. In 2014 IEEE International Conference onGranular Computing (GrC), pages 7–12, Oct 2014.

[15] M. Azad and M. Moshkov. Classification and optimization of decisiontrees for inconsistent decision tables represented as mvd tables. In 2015Federated Conference on Computer Science and Information Systems(FedCSIS), pages 31–38, Sept 2015.

[16] Mohammad Azad and Mikhail Moshkov. Minimization of decision treeaverage depth for decision tables with many-valued decisions. ProcediaComputer Science, 35:368 – 377, 2014. Knowledge-Based and Intel-ligent Information & Engineering Systems 18th Annual Conference,KES-2014 Gdynia, Poland, September 2014 Proceedings.

[17] Mohammad Azad and Mikhail Moshkov. Multi-stage optimization ofdecision and inhibitory trees for decision tables with many-valued de-cisions. European Journal of Operational Research, 263(3):910 – 921,2017.

[18] Adil M Bagirov and John Yearwood. A new nonsmooth optimizationalgorithm for minimum sum-of-squares clustering problems. EuropeanJournal of Operational Research, 170(2):578–596, 2006.

[19] T Bailey and AK Jain. A note on distance-weighted k-nearest neighborrules. IEEE Transactions on Systems, Man, and Cybernetics, (4):311–313, 1978.

[20] Pierre Baldi, Peter Sadowski, and Daniel Whiteson. Enhanced higgsboson to τ+ τ - search with deep learning. Physical review letters,114(11):111801, 2015.

[21] Marco Barreno, Blaine Nelson, Anthony D Joseph, and J Doug Tygar.The security of machine learning. Machine Learning, 81(2):121–148,2010.

46

Page 47: Optimization Models for Machine Learning: A Survey · Optimization Models for Machine Learning: A Survey Claudio Gambella1 Bissan Ghaddar2 Joe Naoum-Sawaya2 1IBM Research Ireland,

[22] Osbert Bastani, Yani Ioannou, Leonidas Lampropoulos, Dimitrios Vy-tiniotis, Aditya Nori, and Antonio Criminisi. Measuring neural netrobustness with constraints. In Advances in neural information pro-cessing systems, pages 2613–2621, 2016.

[23] Peter N Belhumeur, Joao P Hespanha, and David J Kriegman. Eigen-faces vs. fisherfaces: Recognition using class specific linear projection.In European Conference on Computer Vision, pages 43–58. Springer,1996.

[24] Aharon Ben-Tal, Laurent El Ghaoui, and Arkadi Nemirovski. Robustoptimization, volume 28. Princeton University Press, 2009.

[25] Yoshua Bengio, Andrea Lodi, and Antoine Prouvost. Machine learningfor combinatorial optimization: a methodological tour d’horizon. arXivpreprint arXiv:1811.06128, 2018.

[26] Kristin P Bennett. Decision tree construction via linear programming.Center for Parallel Optimization, Computer Sciences Department, Uni-versity of Wisconsin, 1992.

[27] Kristin P Bennett and J Blue. Optimal decision trees. RensselaerPolytechnic Institute Math Report, 214:24, 1996.

[28] Dimitris Bertsimas and Martin S. Copenhaver. Characterization of theequivalence of robustification and regularization in linear and matrixregression. European Journal of Operational Research, 270(3):931 –942, 2018.

[29] Dimitris Bertsimas and Jack Dunn. Optimal classification trees. Ma-chine Learning, 106(7):1039–1082, 2017.

[30] Dimitris Bertsimas and Nathan Kallus. From predictive to prescriptiveanalytics. arXiv preprint arXiv:1402.5481, 2014.

[31] Dimitris Bertsimas and Angela King. OR forum–An algorithmic ap-proach to linear regression. Operations Research, 64(1):2–16, 2016.

[32] Dimitris Bertsimas, Angela King, Rahul Mazumder, et al. Best subsetselection via a modern optimization lens. The annals of statistics,44(2):813–852, 2016.

[33] Dimitris Bertsimas, Jean Pauphilet, and Bart Van Parys. Sparse clas-sification and phase transitions: A discrete optimization perspective.arXiv preprint arXiv:1710.01352, 2017.

47

Page 48: Optimization Models for Machine Learning: A Survey · Optimization Models for Machine Learning: A Survey Claudio Gambella1 Bissan Ghaddar2 Joe Naoum-Sawaya2 1IBM Research Ireland,

[34] Dimitris Bertsimas and Romy Shioda. Classification and regression viainteger optimization. Operations Research, 55(2):252–271, 2007.

[35] Battista Biggio, Igino Corona, Davide Maiorca, Blaine Nelson, NedimSrndic, Pavel Laskov, Giorgio Giacinto, and Fabio Roli. Evasion attacksagainst machine learning at test time. In ECML/PKDD, 2013.

[36] Battista Biggio, Giorgio Fumera, and Fabio Roli. Multiple classifiersystems under attack. In International Workshop on Multiple ClassifierSystems, pages 74–83. Springer, 2010.

[37] Battista Biggio, Blaine Nelson, and Pavel Laskov. Poisoning attacksagainst support vector machines. In Proceedings of the 29th Interna-tional Coference on International Conference on Machine Learning,pages 1467–1474. Omnipress, 2012.

[38] Jock A. Blackard. Forest covertype dataset, 1999. https://archive.ics.uci.edu/ml/datasets/covertype.

[39] Jock A Blackard and Denis J Dean. Comparative accuracies of ar-tificial neural networks and discriminant analysis in predicting forestcover types from cartographic variables. Computers and electronics inagriculture, 24(3):131–151, 1999.

[40] Pierre Bonami, Andrea Lodi, Andrea Tramontani, and Sven Wiese. Onmathematical programming with indicator constraints. MathematicalProgramming, 151(1):191–223, Jun 2015.

[41] Pierre Bonami, Andrea Lodi, and Giulia Zarpellon. Learning a classifi-cation of mixed-integer quadratic programming problems. In Interna-tional Conference on the Integration of Constraint Programming, Ar-tificial Intelligence, and Operations Research, pages 595–604. Springer,2018.

[42] Radu Ioan Bot and Nicole Lorenz. Optimization problems in statisti-cal learning: Duality and optimality conditions. European Journal ofOperational Research, 213(2):395–404, 2011.

[43] Leon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization meth-ods for large-scale machine learning. SIAM Review, 60(2):223–311,2018.

[44] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, Jonathan Eck-stein, et al. Distributed optimization and statistical learning via the

48

Page 49: Optimization Models for Machine Learning: A Survey · Optimization Models for Machine Learning: A Survey Claudio Gambella1 Bissan Ghaddar2 Joe Naoum-Sawaya2 1IBM Research Ireland,

alternating direction method of multipliers. Foundations and Trends R©in Machine learning, 3(1):1–122, 2011.

[45] PS Bradley and OL Mangasarian. Massive data discrimination vialinear support vector machines. Optimization methods and software,13(1):1–10, 2000.

[46] Leo Breiman. Classification and regression trees. 1984.

[47] Michael Bruckner, Christian Kanzow, and Tobias Scheffer. Static pre-diction games for adversarial learning problems. Journal of MachineLearning Research, 13(Sep):2617–2654, 2012.

[48] Michael Bruckner and Tobias Scheffer. Stackelberg games for adversar-ial prediction problems. In Proceedings of the 17th ACM SIGKDD in-ternational conference on Knowledge discovery and data mining, pages547–555. ACM, 2011.

[49] Lars Buitinck, Gilles Louppe, Mathieu Blondel, Fabian Pedregosa, An-dreas Mueller, Olivier Grisel, Vlad Niculae, Peter Prettenhofer, Alexan-dre Gramfort, Jaques Grobler, Robert Layton, Jake VanderPlas, Ar-naud Joly, Brian Holt, and Gael Varoquaux. API design for machinelearning software: experiences from the scikit-learn project. In ECMLPKDD Workshop: Languages for Data Mining and Machine Learning,pages 108–122, 2013.

[50] Rudy Bunel, Ilker Turkaslan, Philip H. S. Torr, Pushmeet Kohli, andM. Pawan Kumar. A unified view of piecewise linear neural networkverification. CoRR, abs/1711.00455, 2017.

[51] Sonia Cafieri, Alberto Costa, and Pierre Hansen. Reformulation of amodel for hierarchical divisive graph modularity maximization. Annalsof Operations Research, 222(1):213–226, 2014.

[52] Sonia Cafieri, Pierre Hansen, and Leo Liberti. Improving heuristics fornetwork modularity maximization using an exact algorithm. DiscreteApplied Mathematics, 163:65 – 72, 2014. Matheuristics 2010.

[53] Nicholas Carlini and David Wagner. Towards evaluating the robust-ness of neural networks. In Security and Privacy (SP), 2017 IEEESymposium on, pages 39–57. IEEE, 2017.

49

Page 50: Optimization Models for Machine Learning: A Survey · Optimization Models for Machine Learning: A Survey Claudio Gambella1 Bissan Ghaddar2 Joe Naoum-Sawaya2 1IBM Research Ireland,

[54] Emilio Carrizosa, Nenad Mladenovi, and Raca Todosijevi. Variableneighborhood search for minimum sum-of-squares clustering on net-works. European Journal of Operational Research, 230(2):356 – 363,2013.

[55] Emilio Carrizosa and Dolores Romero Morales. Supervised classifi-cation and mathematical optimization. Computers & Operations Re-search, 40(1):150–165, 2013.

[56] Antoni B Chan, Nuno Vasconcelos, and Gert RG Lanckriet. Direct con-vex relaxations of sparse svm. In Proceedings of the 24th internationalconference on Machine learning, pages 145–153. ACM, 2007.

[57] Antonio Augusto Chaves and Luiz Antonio Nogueira Lorena. Cluster-ing search algorithm for the capacitated centered clustering problem.Computers & Operations Research, 37(3):552–558, 2010.

[58] Xiaobo Chen, Jian Yang, David Zhang, and Jun Liang. Complete largemargin linear discriminant analysis using mathematical programmingapproach. Pattern recognition, 46(6):1579–1594, 2013.

[59] Chih-Hong Cheng, Georg Nuhrenberg, and Harald Ruess. Maxi-mum resilience of artificial neural networks. In Deepak D’Souza andK. Narayan Kumar, editors, Automated Technology for Verification andAnalysis, pages 251–268, Cham, 2017. Springer International Publish-ing.

[60] Igor Chikalov, Shahid Hussain, and Mikhail Moshkov. Bi-criteria opti-mization of decision trees with applications to data analysis. EuropeanJournal of Operational Research, 266(2):689 – 701, 2018.

[61] Francois Chollet et al. Keras. https://keras.io, 2015.

[62] Alexandra Chouldechova and Trevor Hastie. Generalized additivemodel selection. arXiv preprint arXiv:1506.03850, 2015.

[63] Adam Coates. Stl-10. http://cs.stanford.edu/~acoates/stl10.

[64] Adam Coates, Andrew Ng, and Honglak Lee. An analysis of single-layernetworks in unsupervised feature learning. In Proceedings of the four-teenth international conference on artificial intelligence and statistics,pages 215–223, 2011.

50

Page 51: Optimization Models for Machine Learning: A Survey · Optimization Models for Machine Learning: A Survey Claudio Gambella1 Bissan Ghaddar2 Joe Naoum-Sawaya2 1IBM Research Ireland,

[65] Gregory Cohen, Saeed Afshar, Jonathan Tapson, and Andre vanSchaik. EMNIST: an extension of MNIST to handwritten letters.CoRR, abs/1702.05373, 2017.

[66] Gordon V. Cormack and Thomas R. Lynam. Trec 2007 public corpus,2005. http://plg.uwaterloo.ca/~gvcormac/treccorpus07.

[67] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Ma-chine Learning, 20(3):273–297, Sep 1995.

[68] Paulo Cortez, Antonio Cerdeira, Fernando Almeida, Telmo Matos, andJose Reis. Modeling wine preferences by data mining from physico-chemical properties. Decision Support Systems, 47(4):547–553, 2009.

[69] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Bina-ryconnect: Training deep neural networks with binary weights duringpropagations. In Advances in neural information processing systems,pages 3123–3131, 2015.

[70] Louis Anthony Cox, QIU Yuping, and Warren Kuehner. Heuristicleast-cost computation of discrete classification functions with uncer-tain argument values. Annals of Operations research, 21(1):1–29, 1989.

[71] F. E. Curtis and K. Scheinberg. Optimization Methods for SupervisedMachine Learning: From Linear Models to Deep Learning. ArXiv e-prints, June 2017.

[72] George Cybenko. Approximation by superpositions of a sigmoidal func-tion. Mathematics of control, signals and systems, 2(4):303–314, 1989.

[73] Ofer Dekel, Ohad Shamir, and Lin Xiao. Learning to classify withmissing and corrupted features. Machine learning, 81(2):149–178, 2010.

[74] Stephan Dempe, Vyacheslav Kalashnikov, and Roger Z. Ros-Mercado.Discrete bilevel programming: Application to a natural gas cash-outproblem. European Journal of Operational Research, 166(2):469 – 488,2005.

[75] J. Deng, W. Dong, R. Socher, L. Li, Kai Li, and Li Fei-Fei. Imagenet:A large-scale hierarchical image database. In 2009 IEEE Conference onComputer Vision and Pattern Recognition, pages 248–255, June 2009.

[76] Dua Dheeru and Efi Karra Taniskidou. Isolet dataset, 2017. https:

//archive.ics.uci.edu/ml/datasets/isolet.

51

Page 52: Optimization Models for Machine Learning: A Survey · Optimization Models for Machine Learning: A Survey Claudio Gambella1 Bissan Ghaddar2 Joe Naoum-Sawaya2 1IBM Research Ireland,

[77] Dua Dheeru and Efi Karra Taniskidou. UCI machine learning reposi-tory, 2017.

[78] Gonzalo I Diaz, Achille Fokoue-Nkoutche, Giacomo Nannicini, andHorst Samulowitz. An effective algorithm for hyperparameter optimiza-tion of neural networks. IBM Journal of Research and Development,61(4):9–1, 2017.

[79] Chris Ding and Xiaofeng He. K-means clustering via principal compo-nent analysis. In Proceedings of the twenty-first international confer-ence on Machine learning, page 29. ACM, 2004.

[80] Ulrich Dorndorf and Erwin Pesch. Fast clustering algorithms. ORSAJournal on Computing, 6(2):141–153, 1994.

[81] Stephan Dreiseitl and Lucila Ohno-Machado. Logistic regression andartificial neural network classification models: a methodology review.Journal of biomedical informatics, 35(5-6):352–359, 2002.

[82] Rudiger Ehlers. Formal verification of piece-wise linear feed-forwardneural networks. CoRR, abs/1705.01320, 2017.

[83] Diana Fanghanel and Stephan Dempe. Bilevel programming with dis-crete lower level problems. Optimization, 58(8):1029–1047, 2009.

[84] Mark A Fanty and Ronald Cole. Spoken letter recognition. In NIPS,page 220, 1990.

[85] Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning generative vi-sual models from few training examples: An incremental bayesian ap-proach tested on 101 object categories. Computer vision and Imageunderstanding, 106(1):59–70, 2007.

[86] Marc’Aurelio Ranzato Fei-Fei Li, Marco Andreetto and Pietro Perona.Caltech-101. http://www.vision.caltech.edu/Image_Datasets/

Caltech101/.

[87] Matteo Fischetti and Jason Jo. Deep neural networks and mixed integerlinear optimization. Constraints, pages 1–14, 2018.

[88] R.A. Fisher. Iris data set. http://archive.ics.uci.edu/ml/

datasets/Iris.

[89] Ronald A Fisher. The use of multiple measurements in taxonomicproblems. Annals of eugenics, 7(2):179–188, 1936.

52

Page 53: Optimization Models for Machine Learning: A Survey · Optimization Models for Machine Learning: A Survey Claudio Gambella1 Bissan Ghaddar2 Joe Naoum-Sawaya2 1IBM Research Ireland,

[90] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elementsof statistical learning, volume 1. Springer series in statistics New York,NY, USA:, 2001.

[91] Yann LeCun Fu Jie Huang. The norb dataset, v1.0. https://cs.nyu.edu/~ylclab/data/norb-v1.0/.

[92] Keinosuke Fukunaga. Introduction to statistical pattern recognition.Elsevier, 2013.

[93] Bissan Ghaddar and Joe Naoum-Sawaya. High dimensional data classi-fication and feature selection using support vector machines. EuropeanJournal of Operational Research, 265(3):993–1004, 2018.

[94] Amir Globerson and Sam Roweis. Nightmare at test time: robustlearning by feature deletion. In Proceedings of the 23rd internationalconference on Machine learning, pages 353–360. ACM, 2006.

[95] S.A. Goldman and M.J. Kearns. On the complexity of teaching. Journalof Computer and System Sciences, 50(1):20 – 31, 1995.

[96] Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio.Deep learning, volume 1. MIT press Cambridge, 2016.

[97] Ian J Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville,and Yoshua Bengio. Maxout networks. arXiv preprint arXiv:1302.4389,2013.

[98] Google AI Google Brain. Tensorflow models. https://github.com/

tensorflow/models.

[99] Google AI Google Brain. Tensorflow: An open source machine learningframework for everyone, 2015. https://www.tensorflow.org/.

[100] Ignacio E Grossmann. Review of nonlinear mixed-integer and disjunc-tive programming techniques. Optimization and engineering, 3(3):227–252, 2002.

[101] Shixiang Gu and Luca Rigazio. Towards deep neural network architec-tures robust to adversarial examples. arXiv preprint arXiv:1412.5068,2014.

[102] Zeynep H Gumus and Christodoulos A Floudas. Global optimizationof mixed-integer bilevel programming problems. Computational Man-agement Science, 2(3):181–212, 2005.

53

Page 54: Optimization Models for Machine Learning: A Survey · Optimization Models for Machine Learning: A Survey Claudio Gambella1 Bissan Ghaddar2 Joe Naoum-Sawaya2 1IBM Research Ireland,

[103] Oktay Gunluk, Jayant Kalagnanam, Matt Menickelly, and KatyaScheinberg. Optimal decision trees for categorical data via integer pro-gramming. Optimization Online, 6404, 2018.

[104] Jihun Hamm and Yung-Kyun Noh. K-beam subgradient descent forminimax optimization. CoRR, abs/1805.11640, 2018.

[105] Pierre Hansen and Brigitte Jaumard. Cluster analysis and mathemat-ical programming. Mathematical programming, 79(1-3):191–215, 1997.

[106] M. Harmon and D. Klabjan. Activation Ensembles for Deep NeuralNetworks. ArXiv e-prints, February 2017.

[107] Trevor J Hastie and Robert Tibshirani. Generalized additive models.Statistical science, pages 297–310, 1986.

[108] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delvingdeep into rectifiers: Surpassing human-level performance on imagenetclassification. In Proceedings of the IEEE international conference oncomputer vision, pages 1026–1034, 2015.

[109] Ralf Herbrich. Learning kernel classifiers: theory and algorithms. MitPress, 2001.

[110] Kurt Hornik. Approximation capabilities of multilayer feedforward net-works. Neural networks, 4(2):251–257, 1991.

[111] Gary B Huang, Marwan Mattar, Tamara Berg, and Eric Learned-Miller. Labeled faces in the wild: A database forstudying face recog-nition in unconstrained environments. In Workshop on faces in’Real-Life’Images: detection, alignment, and recognition, 2008.

[112] Laurent Hyafil and Ronald L. Rivest. Constructing optimal binarydecision trees is np-complete. Information Processing Letters, 5(1):15– 17, 1976.

[113] Christian Szegedy Ian J. Goodfellow, Jonathon Shlens. Explainingand harnessing adversarial examples. arXiv preprint arXiv:1412.6572,2014.

[114] IBM. Adversarial robustness toolbox (art v0.3.0), 2017. https://

github.com/IBM/adversarial-robustness-toolbox.

[115] Anil K Jain, M Narasimha Murty, and Patrick J Flynn. Data clustering:a review. ACM computing surveys (CSUR), 31(3):264–323, 1999.

54

Page 55: Optimization Models for Machine Learning: A Survey · Optimization Models for Machine Learning: A Survey Claudio Gambella1 Bissan Ghaddar2 Joe Naoum-Sawaya2 1IBM Research Ireland,

[116] Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani.An introduction to statistical learning, volume 112. Springer, 2013.

[117] Kevin Jarrett, Koray Kavukcuoglu, Marc’Aurelio Ranzato, and YannLeCun. What is the best multi-stage architecture for object recogni-tion? 2009 IEEE 12th International Conference on Computer Vision,pages 2146–2153, 2009.

[118] Ian Jolliffe. Principal component analysis. In International encyclope-dia of statistical science, pages 1094–1096. Springer, 2011.

[119] Napsu Karmitsa, Adil M Bagirov, and Sona Taheri. New diagonalbundle method for clustering problems in large data sets. EuropeanJournal of Operational Research, 263(2):367–379, 2017.

[120] A. Karpathy. ConvNetJS: Deep learning in your browser, 2014. http://cs.stanford.edu/people/karpathy/convnetjs.

[121] Guy Katz, Clark Barrett, David L Dill, Kyle Julian, and Mykel JKochenderfer. Reluplex: An efficient smt solver for verifying deep neu-ral networks. In International Conference on Computer Aided Verifi-cation, pages 97–117. Springer, 2017.

[122] Shuichi Kawano, Hironori Fujisawa, Toyoyuki Takada, and ToshihikoShiroishi. Sparse principal component regression with adaptive loading.Computational Statistics & Data Analysis, 89:192–203, 2015.

[123] Shuichi Kawano, Hironori Fujisawa, Toyoyuki Takada, and ToshihikoShiroishi. Sparse principal component regression for generalized linearmodels. Computational Statistics & Data Analysis, 124:180–196, 2018.

[124] Elias B Khalil, Bistra Dilkina, George L Nemhauser, Shabbir Ahmed,and Yufen Shao. Learning to run heuristics in tree search. In 26thInternational Joint Conference on Artificial Intelligence (IJCAI), 2017.

[125] Elias B Khalil, Amrita Gupta, and Bistra Dilkina. Combinatorial at-tacks on binarized neural networks. arXiv preprint arXiv:1810.03538,2018.

[126] Elias Boutros Khalil, Pierre Le Bodic, Le Song, George L Nemhauser,and Bistra N Dilkina. Learning to branch in mixed integer program-ming. In AAAI, pages 724–731, 2016.

[127] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochasticoptimization. CoRR, abs/1412.6980, 2014.

55

Page 56: Optimization Models for Machine Learning: A Survey · Optimization Models for Machine Learning: A Survey Claudio Gambella1 Bissan Ghaddar2 Joe Naoum-Sawaya2 1IBM Research Ireland,

[128] M. Kraus, S. Feuerriegel, and A. Oztekin. Deep learning in businessanalytics and operations research: Models, applications and managerialimplications. ArXiv e-prints, June 2018.

[129] Alex Krizhevsky. Learning multiple layers of features from tiny images.Technical report, 2009.

[130] Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial ma-chine learning at scale. arXiv preprint arXiv:1611.01236, 2016. Pub-lished as a conference paper at ICLR 2017.

[131] Gert RG Lanckriet, Laurent El Ghaoui, Chiranjib Bhattacharyya, andMichael I Jordan. A robust minimax approach to classification. Journalof Machine Learning Research, 3(Dec):555–582, 2002.

[132] Quoc V Le. Building high-level features using large scale unsupervisedlearning. In Acoustics, Speech and Signal Processing (ICASSP), 2013IEEE International Conference on, pages 8595–8598. IEEE, 2013.

[133] Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner.Gradient-based learning applied to document recognition. Proceedingsof the IEEE, 86(11):2278–2324, 1998.

[134] Yann LeCun, Fu Jie Huang, and Leon Bottou. Learning methods forgeneric object recognition with invariance to pose and lighting. In Com-puter Vision and Pattern Recognition, 2004. CVPR 2004. Proceedingsof the 2004 IEEE Computer Society Conference on, volume 2, pagesII–104. IEEE, 2004.

[135] Francesco Leofante, Nina Narodytska, Luca Pulina, and Armando Tac-chella. Automated verification of neural networks: Advances, chal-lenges and perspectives. arXiv preprint arXiv:1805.09938, 2018.

[136] Mark Lewis, Haibo Wang, and Gary Kochenberger. Exact solutionsto the capacitated clustering problem: A comparison of two models.Annals of Data Science, 1(1):15–23, Mar 2014.

[137] Shuangtao Li, Yuanke Chen, Yanlin Peng, and Lin Bai. Learn-ing more robust features with adversarial training. arXiv preprintarXiv:1804.07757, 2018.

[138] Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. arXivpreprint arXiv:1312.4400, 2013.

56

Page 57: Optimization Models for Machine Learning: A Survey · Optimization Models for Machine Learning: A Survey Claudio Gambella1 Bissan Ghaddar2 Joe Naoum-Sawaya2 1IBM Research Ireland,

[139] Ji Liu and Xiaojin Zhu. The teaching dimension of linear learners. TheJournal of Machine Learning Research, 17(1):5631–5655, 2016.

[140] Andrea Lodi and Giulia Zarpellon. On learning and branching: a sur-vey. TOP, 25(2):207–236, Jul 2017.

[141] Michele Lombardi, Michela Milano, and Andrea Bartolini. Empiricaldecision model learning. Artificial Intelligence, 244:343–367, 2017.

[142] Daniel Lowd and Christopher Meek. Adversarial learning. In Pro-ceedings of the eleventh ACM SIGKDD international conference onKnowledge discovery in data mining, pages 641–647. ACM, 2005.

[143] Andrew Maas. Large movie review dataset, 2011. http://ai.

stanford.edu/~amaas/data/sentiment/.

[144] Andrew L Maas, Raymond E Daly, Peter T Pham, Dan Huang, An-drew Y Ng, and Christopher Potts. Learning word vectors for sentimentanalysis. In Proceedings of the 49th annual meeting of the associationfor computational linguistics: Human language technologies-volume 1,pages 142–150. Association for Computational Linguistics, 2011.

[145] Wolfgang Maass, Georg Schnitger, and Eduardo D Sontag. A com-parison of the computational power of sigmoid and boolean thresholdcircuits. In Theoretical Advances in Neural Computation and Learning,pages 127–151. Springer, 1994.

[146] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, DimitrisTsipras, and Adrian Vladu. Towards deep learning models resistantto adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.

[147] Feng Mai, Michael J Fry, and Jeffrey W Ohlmann. Model-based ca-pacitated clustering with posterior regularization. European Journal ofOperational Research, 2018.

[148] George Forman Jaap Suermondt Mark Hopkins, Erik Reeber. Ucispambase data set, 1997. https://archive.ics.uci.edu/ml/

datasets/spambase.

[149] Cyril Goutte Massih-Reza Amini. Reuters rcv1 rcv2 multilin-gual, multiview text categorization test collection data set, 2009.https://archive.ics.uci.edu/ml/datasets/Reuters+RCV1+

RCV2+Multilingual,+Multiview+Text+Categorization+Test+

collection.

57

Page 58: Optimization Models for Machine Learning: A Survey · Optimization Models for Machine Learning: A Survey Claudio Gambella1 Bissan Ghaddar2 Joe Naoum-Sawaya2 1IBM Research Ireland,

[150] Shike Mei and Xiaojin Zhu. Using machine teaching to identify optimaltraining-set attacks on machine learners. In AAAI, pages 2871–2877,2015.

[151] Guido Montufar. Notes on the number of linear regions of deep neuralnetworks. 2017.

[152] Marcos Negreiros and Augusto Palhano. The capacitated centred clus-tering problem. Computers & operations research, 33(6):1639–1663,2006.

[153] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu,and Andrew Y Ng. Reading digits in natural images with unsupervisedfeature learning. In NIPS workshop on deep learning and unsupervisedfeature learning, volume 2011, page 5, 2011.

[154] Maria-Irina Nicolae, Mathieu Sinn, Minh Ngoc Tran, Ambrish Rawat,Martin Wistuba, Valentina Zantedeschi, Nathalie Baracaldo, BryantChen, Heiko Ludwig, Ian Molloy, and Ben Edwards. Adversarial ro-bustness toolbox v0.3.0. CoRR, 1807.01069, 2018.

[155] University of Washington and Carnegie Mellon University. ApacheMXNet: A flexible and efficient library for deep learning. https://

mxnet.apache.org/.

[156] Wisconsin State Climatology Office. Madison climate. http://www.

aos.wisc.edu/~sco/clim-history/7cities/madison.html.

[157] Xuezhou Zhang Hrag Gorune Ohannessian, Ayon Sen, Scott Alfeld,and Xiaojin Zhu. Optimal teaching for online perceptrons. Universityof Wisconsin - Madison.

[158] Kaustubh R Patil, Xiaojin Zhu, Lukasz Kopec, and Bradley C Love.Optimal teaching for limited-capacity human learners. In Advances inneural information processing systems, pages 2465–2473, 2014.

[159] F. Almeida T. Matos Paulo Cortez, A. Cerdeira and J. Reis. Winequality dataset, 2009. https://archive.ics.uci.edu/ml/datasets/wine+quality.

[160] Payne and Meisel. An algorithm for constructing optimal binary de-cision trees. IEEE Transactions on Computers, C-26(9):905–916, Sept1977.

58

Page 59: Optimization Models for Machine Learning: A Survey · Optimization Models for Machine Learning: A Survey Claudio Gambella1 Bissan Ghaddar2 Joe Naoum-Sawaya2 1IBM Research Ireland,

[161] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Van-derplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, andE. Duchesnay. Scikit-learn: Machine learning in Python. Journal ofMachine Learning Research, 12:2825–2830, 2011.

[162] MR Rao. Cluster analysis and mathematical programming. Journal ofthe American statistical association, 66(335):622–626, 1971.

[163] Kashif Rasul and Han Xiao. Imagenet large scale visual recognitionchallenge 2012. https://research.zalando.com/welcome/mission/research-projects/fashion-mnist/.

[164] Robert Reris and J Paul Brooks. Principal component analysis andoptimization: a tutorial. 2015.

[165] Salah Rifai, Yann N Dauphin, Pascal Vincent, Yoshua Bengio, andXavier Muller. The manifold tangent classifier. In Advances in NeuralInformation Processing Systems, pages 2294–2302, 2011.

[166] Herbert Robbins and Sutton Monro. A stochastic approximationmethod. In Herbert Robbins Selected Papers, pages 102–109. Springer,1985.

[167] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, SanjeevSatheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla,Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet LargeScale Visual Recognition Challenge. International Journal of ComputerVision (IJCV), 115(3):211–252, 2015.

[168] Burcu Saglam, F Sibel Salman, Serpil Sayın, and Metin Turkay. Amixed-integer programming approach to the clustering problem with anapplication in customer segmentation. European Journal of OperationalResearch, 173(3):866–879, 2006.

[169] Georges K Saharidis and Marianthi G Ierapetritou. Resolution methodfor mixed integer bi-level linear problems based on decomposition tech-nique. Journal of Global Optimization, 44(1):29–51, 2009.

[170] Everton Santi, Daniel Aloise, and Simon J Blanchard. A model forclustering data from heterogeneous dissimilarities. European Journalof Operational Research, 253(3):659–672, 2016.

59

Page 60: Optimization Models for Machine Learning: A Survey · Optimization Models for Machine Learning: A Survey Claudio Gambella1 Bissan Ghaddar2 Joe Naoum-Sawaya2 1IBM Research Ireland,

[171] Stephan Scheuerer and Rolf Wendolsky. A scatter search heuristic forthe capacitated clustering problem. European Journal of OperationalResearch, 169(2):533 – 547, 2006. Feature Cluster on Scatter SearchMethods for Optimization.

[172] Thiago Serra, Christian Tjandraatmadja, and Srikumar Ramalingam.Bounding and counting linear regions of deep neural networks. arXivpreprint arXiv:1711.02114, 2017.

[173] Uri Shaham, Yutaro Yamada, and Sahand Negahban. Understandingadversarial training: Increasing local stability of supervised modelsthrough robust optimization. Neurocomputing, 2018.

[174] Ayumi Shinohara and Satoru Miyano. Teachability in computationallearning. New Generation Computing, 8(4):337–347, Feb 1991.

[175] Raymond J Solomonoff. An inductive inference machine. In IRE Con-vention Record, Section on Information Theory, volume 2, pages 56–62,1957.

[176] Jacob Steinhardt, Pang Wei W Koh, and Percy S Liang. Certifieddefenses for data poisoning attacks. In Advances in Neural InformationProcessing Systems, pages 3517–3529, 2017.

[177] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Du-mitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing propertiesof neural networks. arXiv preprint arXiv:1312.6199, 2013.

[178] Pakize Taylan, G-W Weber, and Amir Beck. New approaches to re-gression by generalized additive models and continuous optimizationfor modern applications in finance, science and technology. Optimiza-tion, 56(5-6):675–698, 2007.

[179] Sunil Tiwari, H.M. Wee, and Yosef Daryanto. Big data analytics in sup-ply chain management between 2010 and 2016: Insights to industries.Computers & Industrial Engineering, 115:319 – 330, 2018.

[180] Vincent Tjeng and Russ Tedrake. Evaluating robustness of neu-ral networks with mixed integer programming. arXiv preprintarXiv:1711.07356, 2017.

[181] Florian Tramer, Alexey Kurakin, Nicolas Papernot, Ian Goodfellow,Dan Boneh, and Patrick McDaniel. Ensemble adversarial training:Attacks and defenses. arXiv preprint arXiv:1705.07204, 2017.

60

Page 61: Optimization Models for Machine Learning: A Survey · Optimization Models for Machine Learning: A Survey Claudio Gambella1 Bissan Ghaddar2 Joe Naoum-Sawaya2 1IBM Research Ireland,

[182] Vladimir Vapnik. Statistical learning theory. 1998, volume 3. Wiley,New York, 1998.

[183] Vladimir Vapnik. The nature of statistical learning theory. Springerscience & business media, 2013.

[184] L. Vicente, G. Savard, and J. Judice. Discrete linear bilevel pro-gramming problem. Journal of Optimization Theory and Applications,89(3):597–614, Jun 1996.

[185] Roman Vclavk, Antonn Novk, Pemysl cha, and Zdenk Hanzlek. Accel-erating the branch-and-price algorithm using machine learning. Euro-pean Journal of Operational Research, 2018.

[186] Gang Wang, Angappa Gunasekaran, Eric W.T. Ngai, and Thanos Pa-padopoulos. Big data analytics in logistics and supply chain manage-ment: Certain investigations for research and applications. Interna-tional Journal of Production Economics, 176:98 – 110, 2016.

[187] Hua Wang, Chris Ding, and Heng Huang. Multi-label linear discrim-inant analysis. In European Conference on Computer Vision, pages126–139. Springer, 2010.

[188] Li Wang, Ji Zhu, and Hui Zou. The doubly regularized support vectormachine. Statistica Sinica, 16(2):589, 2006.

[189] Yizhen Wang and Kamalika Chaudhuri. Data poisoning attacks againstonline learning. arXiv preprint arXiv:1808.08994, 2018.

[190] Stephen J Wright. Optimization algorithms for data analysis.IAS/Park City Mathematics Series, to appear, 2017.

[191] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novelimage dataset for benchmarking machine learning algorithms. CoRR,abs/1708.07747, 2017.

[192] Adam Coates Alessandro Bissacco Bo Wu Andrew Y. Ng Yuval Netzer,Tao Wang. The street view house numbers (svhn) dataset. http:

//ufldl.stanford.edu/housenumbers/.

[193] Junlong Zhang and Osman Y Ozaltın. A branch-and-cut algorithm fordiscrete bilevel linear programs. Optim. Online, 2017.

61

Page 62: Optimization Models for Machine Learning: A Survey · Optimization Models for Machine Learning: A Survey Claudio Gambella1 Bissan Ghaddar2 Joe Naoum-Sawaya2 1IBM Research Ireland,

[194] Ji Zhu, Saharon Rosset, Robert Tibshirani, and Trevor J Hastie. 1-norm support vector machines. In Advances in neural informationprocessing systems, pages 49–56, 2004.

[195] Xiaojin Zhu. Machine teaching for bayesian learners in the exponentialfamily. In Advances in Neural Information Processing Systems, pages1905–1913, 2013.

[196] Xiaojin Zhu. Machine teaching: An inverse problem to machine learn-ing and an approach toward optimal education. In AAAI, pages 4083–4087, 2015.

[197] Xiaojin Zhu, Adish Singla, Sandra Zilles, and Anna N Rafferty. Anoverview of machine teaching. arXiv preprint arXiv:1801.05927, 2018.

[198] Martin Zinkevich. Online convex programming and generalized in-finitesimal gradient ascent. In Proceedings of the 20th InternationalConference on Machine Learning (ICML-03), pages 928–936, 2003.

[199] Hui Zou and Trevor Hastie. Regularization and variable selection viathe elastic net. Journal of the Royal Statistical Society: Series B (Sta-tistical Methodology), 67(2):301–320, 2005.

62

Page 63: Optimization Models for Machine Learning: A Survey · Optimization Models for Machine Learning: A Survey Claudio Gambella1 Bissan Ghaddar2 Joe Naoum-Sawaya2 1IBM Research Ireland,

Nam

eD

escr

ipti

onSiz

eD

efau

ltta

skY

ear

MN

IST

Han

dw

ritt

endig

its

60,0

00C

lass

ifica

tion

1998

CIF

AR

-10

Col

orim

ages

60,0

00C

lass

ifica

tion

2009

CIF

AR

-100

As

CIF

AR

-10,

but

100

clas

ses

60,0

00C

lass

ifica

tion

2009

Imag

eNet

Lab

eled

imag

es14

,197

,122

Ob

ject

and

scen

ere

cogn

itio

n20

09IS

OL

ET

Sp

oken

lett

ernam

es77

97C

lass

ifica

tion

1994

ST

L-1

0C

olor

imag

es85

00la

bel

ed,

1000

0unla

bel

edC

lass

ifica

tion

/Unsu

per

vis

edle

arnin

g20

11H

iggs

Col

lisi

onev

ents

(25

feat

ure

s)80

,000

,000

Cla

ssifi

cati

on20

15W

ild

Fac

eim

ages

>13

,000

Cla

ssifi

cati

on20

08Q

uocN

etIm

ages

37,0

00C

lass

ifica

tion

2011

EM

NIS

TH

andw

ritt

ench

arac

ters

and

dig

its

2,25

5,71

0(6

grou

ps)

Cla

ssifi

cati

on20

17Ir

isIr

ispla

nts

150

Cla

ssifi

cati

on,

pat

tern

reco

gnit

ion

1936

Cal

tech

-101

Imag

esof

obje

cts

9,14

6C

lass

ifica

tion

2004

NO

RB

Unif

orm

-col

ored

toys

503D

Ob

ject

reco

gnit

ion

2004

Imag

eNet

Chal

lenge

2012

Phot

ogra

phs

150,

000

Vis

ual

reco

gnit

ion

2012

Fas

hio

n-M

NIS

TIm

ages

60,0

00C

lass

ifica

tion

2017

SV

HN

Dig

itco

lor

imag

es63

0,00

0C

lass

ifica

tion

2011

AC

AS

dat

aset

DN

Npro

per

ties

500

DN

Nve

rifica

tion

2017

Col

lisi

onD

etec

tion

DN

Npro

per

ties

188

DN

Nve

rifica

tion

2017

PC

AM

NIS

TD

NN

pro

per

ties

27D

NN

veri

fica

tion

2018

TR

EC

Cor

pus

Mes

sage

s75

,419

Spam

clas

sifica

tion

2005

UC

ISpam

bas

eD

ata

Set

E-m

ails

4,60

1Spam

clas

sifica

tion

1999

Reu

ters

Cor

pus

Tex

tdocu

men

ts11

1,74

0D

ocu

men

tcl

assi

fica

tion

2009

For

est

Cov

erT

yp

eC

arto

grap

hic

vari

able

s58

1,01

2C

lass

ifica

tion

1999

IMD

BM

ovie

revie

ws

50,0

00Sen

tim

ent

clas

sifica

tion

2011

Win

eQ

ual

ity

Win

esa

mple

s4,

898

Cla

ssifi

cati

onor

regr

essi

on20

09W

isco

nsi

ncl

imat

olog

yC

lim

ate

dat

a-

Cla

ssifi

cati

onor

regr

essi

on-

Tab

le2:

Dat

aset

s-

Des

crip

tion

.

63

Page 64: Optimization Models for Machine Learning: A Survey · Optimization Models for Machine Learning: A Survey Claudio Gambella1 Bissan Ghaddar2 Joe Naoum-Sawaya2 1IBM Research Ireland,

Name Reference Repository Used inMNIST [133] [6] [59], [172], [180], [106],

[22], [177], [113], [53],[146], [78], [125], [87],[69], [94], [73], [35],[101], [173], [165],[137], [37]

CIFAR-10 [129] [1] [10], [22], [53], [146],[69],[113], [173], [165],[137], [176], [138], [97],[189], [138], [97]

CIFAR-100 [129] [1] [10], [106], [138], [97]ImageNet [75] [3] [177], [53], [130]ISOLET [84] [76] [106]STL-10 [64] [63] [106]Higgs [20] Large Hadron Collider [10]Wild [111] [5]

QuocNet [132] - [177]EMNIST [65] [2] -

Iris [89] [88] -Caltech-101 [85] [86] [117]

NORB [134] [91] [117]ImageNet Challenge 2012 [4] [167] [108]

Fashion-MNIST [163] [191] [125], [189]SVHN [192] [153] [69], [138], [97]

ACAS data set - [121] [121], [50]CollisionDetection - [82] [50]

PCAMNIST - [50] [50]TREC Corpus - [66] [36], [73]

UCI Spambase Data Set [77] [148] [73], [94], [47], [189],[150]

Reuters Corpus [13] [149] [165]Forest Cover Type [39] [38] [165]

IMDB [144] [143] [176]Wine quality [68] [159] [150]

Wisconsin climatology - [156] [150]a

a[150] considered the annual number of frozen days for Lake Mendota in Midwest USAfrom 1900 to 2000. The link provided in their paper is not accessible anymore.

Table 3: Datasets - References.

64

Page 65: Optimization Models for Machine Learning: A Survey · Optimization Models for Machine Learning: A Survey Claudio Gambella1 Bissan Ghaddar2 Joe Naoum-Sawaya2 1IBM Research Ireland,

Nam

eSupp

orte

dlibra

ries

/inte

rfac

esP

urp

ose

Yea

rR

efer

ence

Con

vN

etJS

Jav

ascr

ipt

libra

ry,

online

Tra

inin

gN

N20

14[1

20]

Ten

sorfl

owA

PIs

(Pyth

on,

Jav

a,C

,G

O)

Tra

inin

g,ev

aluat

ing

ML

/DL

model

s20

15[9

9],

[98]

AR

TP

yth

onlibra

ryA

dve

rsar

ial

lear

nin

g20

17[1

14],

[154

]sc

ikit

-lea

rnP

yth

onlibra

ryM

achin

ele

arnin

g20

11[4

9],

[161

]K

eras

Pyth

onlibra

ryD

eep

lear

nin

g20

15[6

1]T

hea

no

Pyth

onlibra

ryM

achin

ele

arnin

g20

16[9

]M

XN

etA

PIs

(C+

+,

Pyth

on,

Julia,

R,

Sca

la,

Per

l)D

eep

Lea

rnin

g20

15[1

55]

Pyto

rch

Pyth

onlibra

ryD

eep

lear

nin

g20

17[8

]D

eep

Lea

rnin

gfo

rJav

aJav

aA

PIs

Dee

ple

arnin

g20

14[7

]

Tab

le4:

Fra

mew

orks

for

Mac

hin

eL

earn

ing

task

s.

65