CS 229br: Advanced Topics in the theory of machine learning
Boaz Barak
Join slack team and sign in on your devices!
Ankur MoitraMIT 18.408
Yamini BansalOfficial TF
Dimitris KalimerisUnofficial TF
Gal KaplunUnofficial TF
Preetum NakkiranUnofficial TF
π World
OPT
β
β±data π
Benchmark
100
0
π
β β
data
π
OOD, domain shift,
robustness, fairness, ..
Generalization,..
Optimization,..
Godβs view:
f = argmaxπ πΌ (π) data ]
f = argmaxπ πΌ[ (π β π€) ]
π€ βΌ π|data
Perfect A Perfect B β Approx A Approx B
This seminarβ’ Taste of research results, questions, experiments, and more
β’ Goal: Get to state of art research:
β’ Background and language
β’ Reading and reproducing papers
β’ Trying out extensions
β’ Very experimental and βrough around the edgesβ
β’ A lot of learning on your own and from each other
β’ Hope: Very interactive β in lectures and on slack
CS 229br: Survey of very recent research directions, emphasis on experiments
MIT 18.408: Deeper coverage of foundations, emphasis on theory
Student expectationsNot set in stone but will include:
β’ Pre-reading before lectures
β’ Scribe notes / blog posts (adding proofs and details)
β’ Applied problem sets (replicating papers or mini versions thereof, exploring)
Note: Lectures will not teach practical skills β rely on students to pick up using
suggested tutorials, other resources, and each other.
Unofficial TFs happy to answer questions!
β’ Some theoretical problem sets.
β’ Might have you grade each otherβs work
β’ Projects β self chosen and directed.
Grading: Weβll figure out some grade β hope thatβs not your loss function βΊ
HW0 on
slack
Rest of todayβs lecture:
Blitz through classical learning theory
β’ Special cases, mostly one dimension
β’ βProofs by pictureβ
Convexity
π₯ π¦
π1. Line between x, π π₯ and (π¦, π π¦ ) above π.
2. Tangent line at π₯ (slope πβ²(π₯) ) below π.
3. πβ²β² π₯ > 0 for all π₯
CLAIM: πβ²β² π₯ < 0 implies tangent line at π₯ above π
Proof: π π₯ + πΏ = π π₯ + πΏπβ² π₯ + πΏ2πβ²β² π₯ + π(πΏ3) implies π below line
Convexity
π₯ π¦
π1. Line between x, π π₯ and (π¦, π π¦ ) above π.
2. Tangent line at π₯ (slope πβ²(π₯) ) below π.
3. πβ²β² π₯ > 0 for all π₯
π₯ π¦
π
CLAIM: If f above x, π π₯ β (π¦, π π¦ ) then exists π§ with tangent line at π§ above π
Proof by picture:
Gradient Descent
π₯π‘ π¦
π
π₯π‘+1
π₯π‘+1 = π₯π‘ β ππβ²(π₯π‘)
π π₯π‘ + πΏ β π π₯π‘ + πΏπβ² π₯π‘ +πΏ2
2πβ²β² π₯π‘
π π₯π‘+1 β π π₯π‘ β ππβ² π₯π‘2 +
π2πβ² π₯π‘2
2πβ²β² π₯π‘
β’ If π < 2/πβ²β²(π₯π‘) then make progress
β’ If π βΌ 2/πβ²β²(π₯π‘) then drop by βΌ ππβ² π₯π‘2
Dimension π:
πβ² π₯ β βπ π₯ β βπ
πβ²β² π₯ β π»π x = β2π π₯ β βπΓπ (psd)
π βΌ 1/ππ drop by βΌπ1
ππβ 2
= π π₯π‘ β ππβ² π₯π‘2(1 β
ππβ²β² π₯π‘
2)
Stochastic Gradient Descent
π₯π‘ π¦
π
π₯π‘+1
π₯π‘+1 = π₯π‘ β ππβ²(π₯π‘)
π π₯π‘ + πΏ β π π₯π‘ + πΏπβ² π₯π‘ +πΏ2
2πβ²β² π₯π‘
π π₯π‘+1 β π π₯π‘ β ππβ² π₯π‘2 1 β
ππβ²β² π₯π‘
2+ π2π2πβ²β²(π₯π‘)
β’ If π < 2/πβ²β²(π₯π‘) and (*) ππ2 βͺ πβ² π₯π‘2/πβ²β²(π₯) then make progress
β’ If π βΌ 2/πβ²β²(π₯π‘) and (*) then drop by βΌ ππβ² π₯π‘2
πΌ πβ² π₯ = πβ²(π₯π‘), π πβ² π₯ = π2
Assume πβ² π₯ = πβ² π₯ + π
Mean 0Variance π2
Independent
In Machine Learning:
π π₯ =1
πΟπ=1π πΏπ(π₯)
πβ² π₯π‘ = πΏπβ²(π₯) for π βΌ [π]
Distribution (π, π) over π³ Γ {Β±1}
Supervised learning
Learning
Algorithm
π΄data
S = π₯π , π¦π π=1..π
π β β±
β π = Pr[π π β π] αβπ π =1
πΟπ=1π 1π π₯π β π¦π
Population 0-1 loss Empirical 0-1 loss
Generalization gap: β π β αβπ(π) for π = π΄(π)
Empirical Risk Minimization (ERM):
π΄ π = argminπββ±
αβπ(π)
Bias Variance TradeoffLearning
Algorithm
π΄S = π₯π , π¦π π=1..π π β β±
Empirical Risk Minimization (ERM):
π΄ π = argminπββ±
αβπ(π)
β π = Pr[π π β π] αβπ π =1
πΟπ=1π 1π π₯π β π¦πPopulation 0-1 loss Empirical 0-1 loss
Assume β±πΎ = { π1, β¦ , ππΎ} , αβ ππ = β ππ +π(0, 1/π)
πΎ (log scale)
bias variance
Lo
ss
βΌlog πΎ
π?
βΌ logπΎ βπΌ ??
Can prove:
GAP β€ πlog |β±|
π
Bias Variance TradeoffLearning
Algorithm
π΄S = π₯π , π¦π π=1..π π β β±
Empirical Risk Minimization (ERM):
π΄ π = argminπββ±
αβπ(π)
β π = Pr[π π β π] αβπ π = Οπ=1π 1π π₯π β π¦π
Population 0-1 loss Empirical 0-1 loss
Assume β±πΎ = { π1, β¦ , ππΎ} , αβ ππ = β ππ +π(0, 1/π)
πΎ (log scale)
bias variance
Lo
ss
βΌlog πΎ
π?
βΌ logπΎ βπΌ ??
Can prove:
GAP β€ πlog |β±|
π
Generalization bounds Can prove:
GAP β€ πlog |β±|
πlog |β±| can be replaced with dimensionality / capacity of β±
VC dimension:
max π s.t. β± can fit every labels π¦ β Β±1 π on every samples π₯1, β¦ , π₯π
Rademacher complexity:
max π s.t. β± can fit random labels π¦ βΌ Β±1 π on random π₯1, β¦ , π₯π βΌ π
PAC Bayes:
π = πΌ(π΄ π ; π) where π is training set.
Margin bounds:
max π s.t. linear classifier satisfies π€π , π₯π π¦π > π€π β π₯π / π for correct predictions
General form: If πΆ π βͺ π then αβ π β β π β 0
Challenge: Many modern deep nets and natural πΆ , πΆ π β« π
Intuition: Canβt βoverfitβ β if you do well on π samples, must do well on population
Hope: Find better πΆ?
ICLR 2017
General form: If πΆ π βͺ π then αβ π β β π β 0
Challenge: Many modern deep nets and natural πΆ , πΆ π β« π
Intuition: Canβt βoverfitβ β if you do well on π samples, must do well on population
Hope: Find better πΆ?
Classifier Train Test
Wide ResNet1
100% 92%
ResNet 182 100% 87%
Myrtle3 100% 86%
Performance on CIFAR-10 Performance on noisy CIFAR-10:*
Train Test
100% 10%
100% 10%
100% 10%
1 Zagoruyko-Komodakisβ16 2 He-Zhang-Ren-Sunβ15 3 Page β19
βDouble descentβ
Belkin, Hsu, Ma, Mandal, PNAS 2019.
Nakkiran-Kaplun-Bansal-Yang-B-Sutskever, ICLR 2020
Example:
Data: (π₯π , πβ π₯π + π) where πβ degree πβ polynomial
Algorithm: SGD to minimize Ο π π₯π β π¦π2 where π β β±π=degree π polys
Cartoon:
πβ π
degree π = πΆ(β±)
Lo
ss/e
rro
r
Test
Train
Example:
Data: (π₯π , πβ π₯π + π) where πβ degree πβ polynomial
Algorithm: SGD to minimize Ο π π₯π β π¦π2 where π β β±π=degree π polys
Example:
Data: (π₯π , πβ π₯π + π) where πβ degree πβ polynomial
Algorithm: SGD to minimize Ο π π₯π β π¦π2 where π β β±π=degree π polys
Fourier Transform
Every continuous π: 0,1 β β can be arbitrarily approximated
by π of form
π π₯ = ΟπΌππβ2ππ π½ππ₯
For some natural data, representation π is βniceβ
(e.g. sparse, linearly separable,β¦)
Tasks become simpler after transformation π₯ β π(π₯)
i.e., π β linear in π1 π₯ ,β¦ , ππ(π₯) where ππ π₯ = πβ2ππ π½ππ₯
π:β β βπ embedding
ΰΆ±0
1
|π β π| < π
π π₯, π¦ = (π₯π¦, π₯2, π¦2)
Xavier Bourret Sicotte, https://xavierbourretsicotte.github.io/Kernel_feature_map.html
More approximation
π(π₯) = ΟπΌπ π ππΏπ(π½ππ₯ + πΎπ)
Proof by picture:
π ππΏπ π₯ = max(π₯, 0)
π π + πΏ
ππ π₯ = π ππΏπ(π₯/πΏ β π)
ππ+πΏ π₯ = π ππΏπ(π₯/πΏ β π β πΏ)
π π + πΏ0
1
βπ = ππ β ππ+πΏ πΌπ,π = βπ β βπ
Every continuous π: 0,1 β β can be arbitrarily approximated
by π of form
π π
More approximation
π(π₯) = ΟπΌπ π ππΏπ(π½ππ₯ + πΎπ)
Proof by picture:
π ππΏπ π₯ = max(π₯, 0)
πΌπ,π = βπ β βπ
Every continuous π: 0,1 β β can be arbitrarily approximated
by π of form
π
π
Higher dimension
πΌ(π₯)
π½ 1,0 π₯, π¦ = πΌ π₯
π π₯ = max( πΌ, π₯ + π½, 0) , π + 1 parameters
Higher dimension
πΌ(π₯)
π½ 1,0 π₯, π¦ = πΌ π₯
π π₯ = max( πΌ, π₯ + π½, 0) , π + 1 parameters
How many ReLUβsEvery π: [0,1]πβ β can be approximated as sum of ππΌ,π½ π₯ = max( πΌ, π₯ + π½)
Can discretize βπ+1 to π 1 π points β need at most π 1 π = 2π π ReLUs
(For formal proofs, see Cybenko β89, Hornik β91, Barron β93; MIT 18.408)
THM: For some πβs , 2πβ π ReLUβs are necessary
0
1 1
1/21/2
Random π: 0,1 π β β
Choose π π₯ β {0,1} independently at random
for x β 1/4, 3/4 π
Extend linearly
LEMMA: For every fixed π: 0,1 π β β, Pr β« π β π < 0.1 β€ 2βπβ 2π
THM: For some πβs , 2πβ π ReLUβs are necessary
0
1 1
1/21/2
Random π: 0,1 π β β
Choose π π₯ β {0,1} independently at random
for x β 1/4, 3/4 π
Extend linearly
LEMMA: For every fixed π: 0,1 π β β, Pr β« π β π < 0.1 β€ 2βπβ 2π
LEMMA β THEORM: Need 2πβ 2π
distinct πβs β need β 2π distinct ReLUs
PROOF OF LEMMA:
For π₯ β 1/4,3/4 π let ππ₯ = 1 iff β« π π¦ β π π¦ ππ¦ > 0.2 over y β β1/4,+1/4 π + π₯
Each ππ₯ is independent with expectation at least 1/2
β Pr[1
2πΟππ₯ < 1/4 ] β€ 2β0.1β 2
π 1
2πΟππ₯ β₯ 1/4 β β« π β π β₯ 1/20
Bottom line
For every* function π:βπ β β many choices of π:βπ β βπ s.t.
π β πΏππππππ β π
Want to find π that is useful for many interesting functions π
useful:
β’ π is not too big
β’ Efficiently compute π(π₯) or β¨π π₯ , π π¦ β©
β’ β¨π π₯ , π π¦ β© large β π₯ and π¦ are βsemantically similarβ
β’ For interesting πβs, coefficients of πΏππππππ are structured.
β’ β¦
Or πβs such that π β π interesting for
some non-linear π:β β β
Neural Net
ππ₯ πΏ
Kernel
ππ₯ πΏ
Definitely NN Definitely Kernel
π hand-
crafted
before 1990s
π trained
end-to-end
for task at
hand
π pretrained
with same
data
π pretrained
on different
data
π pretrained
on synthetic
data
π hand-
crafted based
on neural
architectures
Kernel methods (intuitively)
Distance measure πΎ π₯, π₯β²
Input: π₯1, π¦1 , β¦ , (π₯π, π¦π) s.t. πβ π₯π β π¦π
To approx πβ π₯ output π¦π for π₯π closest to π₯
or output ΟπΌππ¦π with πΌπ depending on πΎ(π₯, π₯π)
Hilbert SpaceLinear space: π£ + π’, π β π£
Dot product: π’, π£ + π β π€ = π’, π£ + π π’,π€ , π’, π£ = π£, π’ , π£, π£ β₯ 0
Can solve linear equations of form { π£π , π₯ = ππ } knowing only β¨π£π , π£πβ©
Also do least-square minimization minπ₯
Ο( π£π , π₯ β ππ)^2 knowing only β¨π£π , π£πβ©
DEF: Sym matrix πΎ β βπΓπ is p.s.d. if π£ππΎπ£ β₯ 0 for all π£ β βπ
Equivalently π1 πΎ ,β¦ , ππ πΎ β₯ 0
CLAIM: πΎ is p.s.d. iff* π’ππΎπ£ is inner product
PROOF (β): π’ππΎπ£ = π π’ β π(π£) where π π’1, β¦ , π’π = ( π1π’1, β¦ , πππ’π) , expressed
in eigenbasis
Kernel MethodsGoal: Solve linear equations, least-square minimization, etc. under π
Observation: Enough to compute K x, xβ² = β¨π π₯ , π π₯β² β©
Let ΰ·π₯ = π(π₯), given ΰ·π₯π , π¦π π=1..π want to find ΰ·π€ β βπ s.t. ΰ·π₯π , ΰ·π€ β π¦π βπ
can compute ΰ·π€ = ΟπΌπ ΰ·π₯π using πΎ(π₯π , π₯π)
To compute prediction on new π₯ using ΰ·π€, can compute ΰ·π₯ = Οπ½π ΰ·π₯π using πΎ(π₯, π₯π)