large scale optimization for machine learning...• useful optimization tools for machine learning...

LargeScaleOptimizationforMachineLearning

Meisam RazaviyaynLecture 0

[email protected]

What is this course about?• Useful optimization tools for machine learning

• Focus is on the algorithms and the analysis

• This is NOT a ML course• Don’t expect to learn detailed ML

• This is NOT a classical optimization course• We won’t cover many classical optimization results

• We cover some basics though• Few weeks on optimization• Some ML examples will be explained in details

1

Prerequisites

• ProbabilityandStatistics• Expectedvalue,variance,statisticalindependence,conditionalprobability,maximumlikelihoodestimation,regression,etc.

• LinearAlgebraandMathematicalAnalysis• Sets,functions,limits,liminf,limsup,derivative,gradient,subspace,lineardependence,innerproduct,eigenvalue,singularvalue,norms,etc.

• Programmingskills• Matrix/vectoroperationsinMatlab/Python/C++• “For,while,repeatuntil”loops

2

What is optimization?

3


Decisionvariable(Discrete/Continuous)

Objective/Costfunction

FeasibleRegion

3


Decisionvariable(Discrete/Continuous)

Objective/Costfunction

FeasibleRegion• Existenceofasolution?• Checkingifacandidatexisoptimal?• Findingtheoptimalsolutionnumerically

3

Why do we care?

• Manyengineeringproblemsrequiresoptimization

• OptimizationiskeytoefficientimplementationofMLalgorithms

4

Problem Example: RegressionArea CrimeRate Age RAD PTRATIO Bedrooms … Price(K)

600 1.05 12 2.4 10.1 1 … 500

1000 2.34 10 2.5 20.1 1 … 800

1200 1.45 3 3.1 9.7 3 … 1500

1500 1.56 30 1.7 7.2 2 … 1200

… … … … … … … …

2700 1.01 20 0.9 4.3 4 … 5000

5

Problem Example: RegressionSamples/Datapo

ints

Features/independentVariablesTarget/DependentVariables/Label

Area CrimeRate Age RAD PTRATIO Bedrooms … Price(K)

600 1.05 12 2.4 10.1 1 … 500

1000 2.34 10 2.5 20.1 1 … 800

1200 1.45 3 3.1 9.7 3 … 1500

1500 1.56 30 1.7 7.2 2 … 1200

… … … … … … … …

2700 1.01 20 0.9 4.3 4 … 5000

5

Problem Example: Regression

Canweusethisdatasettopredictthepriceofthishouse?

1400 2.2 3 3.1 7.6 2 … ????

Samples/Datapo

ints

Area CrimeRate Age RAD PTRATIO Bedrooms … Price(K)

600 1.05 12 2.4 10.1 1 … 500

1000 2.34 10 2.5 20.1 1 … 800

1200 1.45 3 3.1 9.7 3 … 1500

1500 1.56 30 1.7 7.2 2 … 1200

… … … … … … … …

2700 1.01 20 0.9 4.3 4 … 5000

5

Features/independentVariablesTarget/DependentVariables/Label


600 1.05 12 2.4 10.1 1 … 500

1000 2.34 10 2.5 20.1 1 … 800

1200 1.45 3 3.1 9.7 3 … 1500

1500 1.56 30 1.7 7.2 2 … 1200

… … … … … … … …

2700 1.01 20 0.9 4.3 4 … 5000


1400 2.2 3 3.1 7.6 2 … ????

TrainingData

5


600 1.05 12 2.4 10.1 1 … 500

1000 2.34 10 2.5 20.1 1 … 800

1200 1.45 3 3.1 9.7 3 … 1500

1500 1.56 30 1.7 7.2 2 … 1200

… … … … … … … …

2700 1.01 20 0.9 4.3 4 … 5000

1400 2.2 3 3.1 7.6 2 … ????

TrainingData

LearningAlgorithm Prediction

5

Learning Algorithms

• VariousmethodsinML• Decisiontrees,deeplearning,Bayes,empiricalBayes,linearregression,logisticregression,…

• Manymethods

• Model

• Minimizetheloss/Maximizethelikelihood

6

Algorithm Example: Linear regressionArea CrimeRate Age RAD PTRATIO Bedrooms … Price(K)

600 1.05 12 2.4 10.1 1 … 500

1000 2.34 10 2.5 20.1 1 … 800

1200 1.45 3 3.1 9.7 3 … 1500

1500 1.56 30 1.7 7.2 2 … 1200

… … … … … … … …

2700 1.01 20 0.9 4.3 4 … 5000


1400 2.2 3 3.1 7.6 2 … ????

7


600 1.05 12 2.4 10.1 1 … 500

1000 2.34 10 2.5 20.1 1 … 800

1200 1.45 3 3.1 9.7 3 … 1500

1500 1.56 30 1.7 7.2 2 … 1200

… … … … … … … …

2700 1.01 20 0.9 4.3 4 … 5000

7


600 1.05 12 2.4 10.1 1 … 500

1000 2.34 10 2.5 20.1 1 … 800

1200 1.45 3 3.1 9.7 3 … 1500

1500 1.56 30 1.7 7.2 2 … 1200

… … … … … … … …

2700 1.01 20 0.9 4.3 4 … 5000

Model:LinearpredictorLoss:L2difference

7


600 1.05 12 2.4 10.1 1 … 500

1000 2.34 10 2.5 20.1 1 … 800

1200 1.45 3 3.1 9.7 3 … 1500

1500 1.56 30 1.7 7.2 2 … 1200

… … … … … … … …

2700 1.01 20 0.9 4.3 4 … 5000

Model:LinearpredictorLoss:L2difference

Offsetincluded

7

Another Example: Classification

Radius Texture Area Compactness Symmetry … Rec/non-Rec

1.1 2.3 3.5 2.4 1.4 … 1

0.7 1.2 2.5 1.4 3.2 … 0

1.7 2.4 1.5 3.3 1.3 … 1

… … … … … … …

0.2 3.4 0.7 4.3 2.0 … 1

0.2 2.7 0.9 2.3 1.0 … ????

8

Algorithm Example: Logistic Regression


1.1 2.3 3.5 2.4 1.4 … 1

0.7 1.2 2.5 1.4 3.2 … 0

1.7 2.4 1.5 3.3 1.3 … 1

… … … … … … …

0.2 3.4 0.7 4.3 2.0 … 1

9

Algorithm Example: Logistic Regression


1.1 2.3 3.5 2.4 1.4 … 1

0.7 1.2 2.5 1.4 3.2 … 0

1.7 2.4 1.5 3.3 1.3 … 1

… … … … … … …

0.2 3.4 0.7 4.3 2.0 … 1

Model:logisticMaximumlikelihoodestimator

9

Optimization in ML

• Many more examples (K-means, SVM, Deep learning, …)

• Efficient algorithms: CPU, Memory requirements, Parallelizable, robustness, etc.

• Other issues: Non-convexity, Sparsity, Large values of n/d, Online implementation, Implicit

bias, Overfitting, etc.

• But first, we need to review a bit of optimization (Targeted review!)

10

Optimization in ML

• Many more examples (K-means, SVM, Deep learning, …)

• Efficient algorithms: CPU, Memory requirements, Parallelizable, robustness, etc.

• Other issues: Non-convexity, Sparsity, Large values of n/d, Online implementation, Implicit

bias, Overfitting, etc.

• But first, we need to review a bit of optimization (Targeted review!)

• Even before this, let’s review a bit of linear algebra and mathematical analysis10

Notations• Sets•• Realnumbers,Complexnumbers

11


• Inf and Sup• Supremumoftheset isthesmallestscalar suchthat• Infimumoftheset isthelargestscalar suchthat

11



inf{sinn : n � 1} =?

11



inf{sinn : n � 1} =?

• Functions:

11

Vectors and Subspaces• Linear combination:

12


• Subspace and linear independence• Asetiscalledsubspaceifitisclosedunderlinearcombination• Asetofvectorsiscalledlinearlyindependentifnolinearcombinationofthemisequaltozero• Innerproduct:• Orthogonality:

12


• Subspace and linear independence• Asetiscalledsubspaceifitisclosedunderlinearcombination• Asetofvectorsiscalledlinearlyindependentifnolinearcombinationofthemisequaltozero• Innerproduct:• Orthogonality:

• Cauchy-Schwarz inequality

12

Matrices• Matrix addition• Matrix product• Square matrix• Inner product:

13


• Spectral radius:

13


• Spectral radius:

• Eigenvalue decomposition of symmetric matrices• Positive (Semi-)definite matrices

13

Matrices• Singular values:

14

Matrices• Singular values:• Singular value decomposition of :

14


• Norms:• Frobenius norm:

• Nuclear norm:

• Matrix 2-norm:

14


• Norms:• Frobenius norm:

• Nuclear norm:

• Matrix 2-norm:

• Useful inequalities:

14

large scale optimization for machine learning...• useful optimization tools for machine learning...

Documents