large scale optimization for machine learning...• useful optimization tools for machine learning...
TRANSCRIPT
What is this course about?• Useful optimization tools for machine learning
• Focus is on the algorithms and the analysis
• This is NOT a ML course• Don’t expect to learn detailed ML
• This is NOT a classical optimization course• We won’t cover many classical optimization results
• We cover some basics though• Few weeks on optimization• Some ML examples will be explained in details
1
Prerequisites
• ProbabilityandStatistics• Expectedvalue,variance,statisticalindependence,conditionalprobability,maximumlikelihoodestimation,regression,etc.
• LinearAlgebraandMathematicalAnalysis• Sets,functions,limits,liminf,limsup,derivative,gradient,subspace,lineardependence,innerproduct,eigenvalue,singularvalue,norms,etc.
• Programmingskills• Matrix/vectoroperationsinMatlab/Python/C++• “For,while,repeatuntil”loops
2
What is optimization?
3
What is optimization?
Decisionvariable(Discrete/Continuous)
Objective/Costfunction
FeasibleRegion
3
What is optimization?
Decisionvariable(Discrete/Continuous)
Objective/Costfunction
FeasibleRegion• Existenceofasolution?• Checkingifacandidatexisoptimal?• Findingtheoptimalsolutionnumerically
3
Why do we care?
• Manyengineeringproblemsrequiresoptimization
• OptimizationiskeytoefficientimplementationofMLalgorithms
4
Problem Example: RegressionArea CrimeRate Age RAD PTRATIO Bedrooms … Price(K)
600 1.05 12 2.4 10.1 1 … 500
1000 2.34 10 2.5 20.1 1 … 800
1200 1.45 3 3.1 9.7 3 … 1500
1500 1.56 30 1.7 7.2 2 … 1200
… … … … … … … …
2700 1.01 20 0.9 4.3 4 … 5000
5
Problem Example: RegressionSamples/Datapo
ints
Features/independentVariablesTarget/DependentVariables/Label
Area CrimeRate Age RAD PTRATIO Bedrooms … Price(K)
600 1.05 12 2.4 10.1 1 … 500
1000 2.34 10 2.5 20.1 1 … 800
1200 1.45 3 3.1 9.7 3 … 1500
1500 1.56 30 1.7 7.2 2 … 1200
… … … … … … … …
2700 1.01 20 0.9 4.3 4 … 5000
5
Problem Example: Regression
Canweusethisdatasettopredictthepriceofthishouse?
1400 2.2 3 3.1 7.6 2 … ????
Samples/Datapo
ints
Area CrimeRate Age RAD PTRATIO Bedrooms … Price(K)
600 1.05 12 2.4 10.1 1 … 500
1000 2.34 10 2.5 20.1 1 … 800
1200 1.45 3 3.1 9.7 3 … 1500
1500 1.56 30 1.7 7.2 2 … 1200
… … … … … … … …
2700 1.01 20 0.9 4.3 4 … 5000
5
Features/independentVariablesTarget/DependentVariables/Label
Problem Example: RegressionArea CrimeRate Age RAD PTRATIO Bedrooms … Price(K)
600 1.05 12 2.4 10.1 1 … 500
1000 2.34 10 2.5 20.1 1 … 800
1200 1.45 3 3.1 9.7 3 … 1500
1500 1.56 30 1.7 7.2 2 … 1200
… … … … … … … …
2700 1.01 20 0.9 4.3 4 … 5000
Canweusethisdatasettopredictthepriceofthishouse?
1400 2.2 3 3.1 7.6 2 … ????
TrainingData
5
Problem Example: RegressionArea CrimeRate Age RAD PTRATIO Bedrooms … Price(K)
600 1.05 12 2.4 10.1 1 … 500
1000 2.34 10 2.5 20.1 1 … 800
1200 1.45 3 3.1 9.7 3 … 1500
1500 1.56 30 1.7 7.2 2 … 1200
… … … … … … … …
2700 1.01 20 0.9 4.3 4 … 5000
1400 2.2 3 3.1 7.6 2 … ????
TrainingData
LearningAlgorithm Prediction
5
Learning Algorithms
• VariousmethodsinML• Decisiontrees,deeplearning,Bayes,empiricalBayes,linearregression,logisticregression,…
• Manymethods
• Model
• Minimizetheloss/Maximizethelikelihood
6
Algorithm Example: Linear regressionArea CrimeRate Age RAD PTRATIO Bedrooms … Price(K)
600 1.05 12 2.4 10.1 1 … 500
1000 2.34 10 2.5 20.1 1 … 800
1200 1.45 3 3.1 9.7 3 … 1500
1500 1.56 30 1.7 7.2 2 … 1200
… … … … … … … …
2700 1.01 20 0.9 4.3 4 … 5000
Canweusethisdatasettopredictthepriceofthishouse?
1400 2.2 3 3.1 7.6 2 … ????
7
Algorithm Example: Linear regressionArea CrimeRate Age RAD PTRATIO Bedrooms … Price(K)
600 1.05 12 2.4 10.1 1 … 500
1000 2.34 10 2.5 20.1 1 … 800
1200 1.45 3 3.1 9.7 3 … 1500
1500 1.56 30 1.7 7.2 2 … 1200
… … … … … … … …
2700 1.01 20 0.9 4.3 4 … 5000
7
Algorithm Example: Linear regressionArea CrimeRate Age RAD PTRATIO Bedrooms … Price(K)
600 1.05 12 2.4 10.1 1 … 500
1000 2.34 10 2.5 20.1 1 … 800
1200 1.45 3 3.1 9.7 3 … 1500
1500 1.56 30 1.7 7.2 2 … 1200
… … … … … … … …
2700 1.01 20 0.9 4.3 4 … 5000
7
Algorithm Example: Linear regressionArea CrimeRate Age RAD PTRATIO Bedrooms … Price(K)
600 1.05 12 2.4 10.1 1 … 500
1000 2.34 10 2.5 20.1 1 … 800
1200 1.45 3 3.1 9.7 3 … 1500
1500 1.56 30 1.7 7.2 2 … 1200
… … … … … … … …
2700 1.01 20 0.9 4.3 4 … 5000
Model:LinearpredictorLoss:L2difference
7
Algorithm Example: Linear regressionArea CrimeRate Age RAD PTRATIO Bedrooms … Price(K)
600 1.05 12 2.4 10.1 1 … 500
1000 2.34 10 2.5 20.1 1 … 800
1200 1.45 3 3.1 9.7 3 … 1500
1500 1.56 30 1.7 7.2 2 … 1200
… … … … … … … …
2700 1.01 20 0.9 4.3 4 … 5000
Model:LinearpredictorLoss:L2difference
Offsetincluded
7
Another Example: Classification
Radius Texture Area Compactness Symmetry … Rec/non-Rec
1.1 2.3 3.5 2.4 1.4 … 1
0.7 1.2 2.5 1.4 3.2 … 0
1.7 2.4 1.5 3.3 1.3 … 1
… … … … … … …
0.2 3.4 0.7 4.3 2.0 … 1
0.2 2.7 0.9 2.3 1.0 … ????
8
Algorithm Example: Logistic Regression
Radius Texture Area Compactness Symmetry … Rec/non-Rec
1.1 2.3 3.5 2.4 1.4 … 1
0.7 1.2 2.5 1.4 3.2 … 0
1.7 2.4 1.5 3.3 1.3 … 1
… … … … … … …
0.2 3.4 0.7 4.3 2.0 … 1
9
Algorithm Example: Logistic Regression
Radius Texture Area Compactness Symmetry … Rec/non-Rec
1.1 2.3 3.5 2.4 1.4 … 1
0.7 1.2 2.5 1.4 3.2 … 0
1.7 2.4 1.5 3.3 1.3 … 1
… … … … … … …
0.2 3.4 0.7 4.3 2.0 … 1
Model:logisticMaximumlikelihoodestimator
9
Algorithm Example: Logistic Regression
Radius Texture Area Compactness Symmetry … Rec/non-Rec
1.1 2.3 3.5 2.4 1.4 … 1
0.7 1.2 2.5 1.4 3.2 … 0
1.7 2.4 1.5 3.3 1.3 … 1
… … … … … … …
0.2 3.4 0.7 4.3 2.0 … 1
Model:logisticMaximumlikelihoodestimator
9
Optimization in ML
• Many more examples (K-means, SVM, Deep learning, …)
• Efficient algorithms: CPU, Memory requirements, Parallelizable, robustness, etc.
• Other issues: Non-convexity, Sparsity, Large values of n/d, Online implementation, Implicit
bias, Overfitting, etc.
• But first, we need to review a bit of optimization (Targeted review!)
10
Optimization in ML
• Many more examples (K-means, SVM, Deep learning, …)
• Efficient algorithms: CPU, Memory requirements, Parallelizable, robustness, etc.
• Other issues: Non-convexity, Sparsity, Large values of n/d, Online implementation, Implicit
bias, Overfitting, etc.
• But first, we need to review a bit of optimization (Targeted review!)
• Even before this, let’s review a bit of linear algebra and mathematical analysis10
Notations• Sets•• Realnumbers,Complexnumbers
11
Notations• Sets•• Realnumbers,Complexnumbers
• Inf and Sup• Supremumoftheset isthesmallestscalar suchthat• Infimumoftheset isthelargestscalar suchthat
11
Notations• Sets•• Realnumbers,Complexnumbers
• Inf and Sup• Supremumoftheset isthesmallestscalar suchthat• Infimumoftheset isthelargestscalar suchthat
11
Notations• Sets•• Realnumbers,Complexnumbers
• Inf and Sup• Supremumoftheset isthesmallestscalar suchthat• Infimumoftheset isthelargestscalar suchthat
inf{sinn : n � 1} =?
11
Notations• Sets•• Realnumbers,Complexnumbers
• Inf and Sup• Supremumoftheset isthesmallestscalar suchthat• Infimumoftheset isthelargestscalar suchthat
inf{sinn : n � 1} =?
• Functions:
11
Vectors and Subspaces• Linear combination:
12
Vectors and Subspaces• Linear combination:
• Subspace and linear independence• Asetiscalledsubspaceifitisclosedunderlinearcombination• Asetofvectorsiscalledlinearlyindependentifnolinearcombinationofthemisequaltozero• Innerproduct:• Orthogonality:
12
Vectors and Subspaces• Linear combination:
• Subspace and linear independence• Asetiscalledsubspaceifitisclosedunderlinearcombination• Asetofvectorsiscalledlinearlyindependentifnolinearcombinationofthemisequaltozero• Innerproduct:• Orthogonality:
• Cauchy-Schwarz inequality
12
Matrices• Matrix addition• Matrix product• Square matrix• Inner product:
13
Matrices• Matrix addition• Matrix product• Square matrix• Inner product:
• Spectral radius:
13
Matrices• Matrix addition• Matrix product• Square matrix• Inner product:
• Spectral radius:
• Eigenvalue decomposition of symmetric matrices• Positive (Semi-)definite matrices
13
Matrices• Singular values:
14
Matrices• Singular values:• Singular value decomposition of :
14
Matrices• Singular values:• Singular value decomposition of :
• Norms:• Frobenius norm:
• Nuclear norm:
• Matrix 2-norm:
14
Matrices• Singular values:• Singular value decomposition of :
• Norms:• Frobenius norm:
• Nuclear norm:
• Matrix 2-norm:
• Useful inequalities:
14