cs 229br: advanced topics in the theory of machine learning

Post on 24-Jan-2022

2 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

CS 229br: Advanced Topics in the theory of machine learning

Boaz Barak

Join slack team and sign in on your devices!

Ankur MoitraMIT 18.408

Yamini BansalOfficial TF

Dimitris KalimerisUnofficial TF

Gal KaplunUnofficial TF

Preetum NakkiranUnofficial TF

What is learning?

𝑓 World

What is learning?

𝑓 World

What is deep learning?

𝑓 World

OPT

β„’

β„±data 𝑓

Benchmark

100

0

𝑓

β„’ ℝ

data

𝑓

OOD, domain shift,

robustness, fairness, ..

Generalization,..

Optimization,..

God’s view:

f = argmax𝑓 𝔼 (𝑓) data ]

f = argmax𝑓 𝔼[ (𝑓 ↔ 𝑀) ]

𝑀 ∼ π‘Š|data

Perfect A Perfect B β‰  Approx A Approx B

This seminarβ€’ Taste of research results, questions, experiments, and more

β€’ Goal: Get to state of art research:

β€’ Background and language

β€’ Reading and reproducing papers

β€’ Trying out extensions

β€’ Very experimental and β€œrough around the edges”

β€’ A lot of learning on your own and from each other

β€’ Hope: Very interactive – in lectures and on slack

CS 229br: Survey of very recent research directions, emphasis on experiments

MIT 18.408: Deeper coverage of foundations, emphasis on theory

Student expectationsNot set in stone but will include:

β€’ Pre-reading before lectures

β€’ Scribe notes / blog posts (adding proofs and details)

β€’ Applied problem sets (replicating papers or mini versions thereof, exploring)

Note: Lectures will not teach practical skills – rely on students to pick up using

suggested tutorials, other resources, and each other.

Unofficial TFs happy to answer questions!

β€’ Some theoretical problem sets.

β€’ Might have you grade each other’s work

β€’ Projects – self chosen and directed.

Grading: We’ll figure out some grade – hope that’s not your loss function ☺

HW0 on

slack

Rest of today’s lecture:

Blitz through classical learning theory

β€’ Special cases, mostly one dimension

β€’ β€œProofs by picture”

Part I: Convexity & Optimization

Convexity

π‘₯ 𝑦

𝑓1. Line between x, 𝑓 π‘₯ and (𝑦, 𝑓 𝑦 ) above 𝑓.

2. Tangent line at π‘₯ (slope 𝑓′(π‘₯) ) below 𝑓.

3. 𝑓′′ π‘₯ > 0 for all π‘₯

CLAIM: 𝑓′′ π‘₯ < 0 implies tangent line at π‘₯ above 𝑓

Proof: 𝑓 π‘₯ + 𝛿 = 𝑓 π‘₯ + 𝛿𝑓′ π‘₯ + 𝛿2𝑓′′ π‘₯ + 𝑂(𝛿3) implies 𝑓 below line

Convexity

π‘₯ 𝑦

𝑓1. Line between x, 𝑓 π‘₯ and (𝑦, 𝑓 𝑦 ) above 𝑓.

2. Tangent line at π‘₯ (slope 𝑓′(π‘₯) ) below 𝑓.

3. 𝑓′′ π‘₯ > 0 for all π‘₯

π‘₯ 𝑦

𝑓

CLAIM: If f above x, 𝑓 π‘₯ βˆ’ (𝑦, 𝑓 𝑦 ) then exists 𝑧 with tangent line at 𝑧 above 𝑓

Proof by picture:

Gradient Descent

π‘₯𝑑 𝑦

𝑓

π‘₯𝑑+1

π‘₯𝑑+1 = π‘₯𝑑 βˆ’ πœ‚π‘“β€²(π‘₯𝑑)

𝑓 π‘₯𝑑 + 𝛿 β‰ˆ 𝑓 π‘₯𝑑 + 𝛿𝑓′ π‘₯𝑑 +𝛿2

2𝑓′′ π‘₯𝑑

𝑓 π‘₯𝑑+1 β‰ˆ 𝑓 π‘₯𝑑 βˆ’ πœ‚π‘“β€² π‘₯𝑑2 +

πœ‚2𝑓′ π‘₯𝑑2

2𝑓′′ π‘₯𝑑

β€’ If πœ‚ < 2/𝑓′′(π‘₯𝑑) then make progress

β€’ If πœ‚ ∼ 2/𝑓′′(π‘₯𝑑) then drop by ∼ πœ‚π‘“β€² π‘₯𝑑2

Dimension 𝑑:

𝑓′ π‘₯ β†’ βˆ‡π‘“ π‘₯ ∈ ℝ𝑑

𝑓′′ π‘₯ β†’ 𝐻𝑓 x = βˆ‡2𝑓 π‘₯ ∈ ℝ𝑑×𝑑 (psd)

πœ‚ ∼ 1/πœ†π‘‘ drop by βˆΌπœ†1

πœ†π‘‘βˆ‡ 2

= 𝑓 π‘₯𝑑 βˆ’ πœ‚π‘“β€² π‘₯𝑑2(1 βˆ’

πœ‚π‘“β€²β€² π‘₯𝑑

2)

Stochastic Gradient Descent

π‘₯𝑑 𝑦

𝑓

π‘₯𝑑+1

π‘₯𝑑+1 = π‘₯𝑑 βˆ’ πœ‚π‘“β€²(π‘₯𝑑)

𝑓 π‘₯𝑑 + 𝛿 β‰ˆ 𝑓 π‘₯𝑑 + 𝛿𝑓′ π‘₯𝑑 +𝛿2

2𝑓′′ π‘₯𝑑

𝑓 π‘₯𝑑+1 β‰ˆ 𝑓 π‘₯𝑑 βˆ’ πœ‚π‘“β€² π‘₯𝑑2 1 βˆ’

πœ‚π‘“β€²β€² π‘₯𝑑

2+ πœ‚2𝜎2𝑓′′(π‘₯𝑑)

β€’ If πœ‚ < 2/𝑓′′(π‘₯𝑑) and (*) πœ‚πœŽ2 β‰ͺ 𝑓′ π‘₯𝑑2/𝑓′′(π‘₯) then make progress

β€’ If πœ‚ ∼ 2/𝑓′′(π‘₯𝑑) and (*) then drop by ∼ πœ‚π‘“β€² π‘₯𝑑2

𝔼 𝑓′ π‘₯ = 𝑓′(π‘₯𝑑), 𝑉 𝑓′ π‘₯ = 𝜎2

Assume 𝑓′ π‘₯ = 𝑓′ π‘₯ + 𝑁

Mean 0Variance 𝜎2

Independent

In Machine Learning:

𝑓 π‘₯ =1

𝑛σ𝑖=1𝑛 𝐿𝑖(π‘₯)

𝑓′ π‘₯𝑑 = 𝐿𝑖′(π‘₯) for 𝑖 ∼ [𝑛]

Part II: Generalization

Distribution (𝑋, π‘Œ) over 𝒳 Γ— {Β±1}

Supervised learning

Learning

Algorithm

𝐴data

S = π‘₯𝑖 , 𝑦𝑖 𝑖=1..𝑛

𝑓 ∈ β„±

β„’ 𝑓 = Pr[𝑓 𝑋 β‰  π‘Œ] αˆ˜β„’π‘† 𝑓 =1

𝑛σ𝑖=1𝑛 1𝑓 π‘₯𝑖 ≠𝑦𝑖

Population 0-1 loss Empirical 0-1 loss

Generalization gap: β„’ 𝑓 βˆ’ αˆ˜β„’π‘†(𝑓) for 𝑓 = 𝐴(𝑆)

Empirical Risk Minimization (ERM):

𝐴 𝑆 = argminπ‘“βˆˆβ„±

αˆ˜β„’π‘†(𝑓)

Bias Variance TradeoffLearning

Algorithm

𝐴S = π‘₯𝑖 , 𝑦𝑖 𝑖=1..𝑛 𝑓 ∈ β„±

Empirical Risk Minimization (ERM):

𝐴 𝑆 = argminπ‘“βˆˆβ„±

αˆ˜β„’π‘†(𝑓)

β„’ 𝑓 = Pr[𝑓 𝑋 β‰  π‘Œ] αˆ˜β„’π‘† 𝑓 =1

𝑛σ𝑖=1𝑛 1𝑓 π‘₯𝑖 ≠𝑦𝑖Population 0-1 loss Empirical 0-1 loss

Assume ℱ𝐾 = { 𝑓1, … , 𝑓𝐾} , αˆ˜β„’ 𝑓𝑖 = β„’ 𝑓𝑖 +𝑁(0, 1/𝑛)

𝐾 (log scale)

bias variance

Lo

ss

∼log 𝐾

𝑛?

∼ log𝐾 βˆ’π›Ό ??

Can prove:

GAP ≀ 𝑂log |β„±|

𝑛

Bias Variance TradeoffLearning

Algorithm

𝐴S = π‘₯𝑖 , 𝑦𝑖 𝑖=1..𝑛 𝑓 ∈ β„±

Empirical Risk Minimization (ERM):

𝐴 𝑆 = argminπ‘“βˆˆβ„±

αˆ˜β„’π‘†(𝑓)

β„’ 𝑓 = Pr[𝑓 𝑋 β‰  π‘Œ] αˆ˜β„’π‘† 𝑓 = σ𝑖=1𝑛 1𝑓 π‘₯𝑖 ≠𝑦𝑖

Population 0-1 loss Empirical 0-1 loss

Assume ℱ𝐾 = { 𝑓1, … , 𝑓𝐾} , αˆ˜β„’ 𝑓𝑖 = β„’ 𝑓𝑖 +𝑁(0, 1/𝑛)

𝐾 (log scale)

bias variance

Lo

ss

∼log 𝐾

𝑛?

∼ log𝐾 βˆ’π›Ό ??

Can prove:

GAP ≀ 𝑂log |β„±|

𝑛

Generalization bounds Can prove:

GAP ≀ 𝑂log |β„±|

𝑛log |β„±| can be replaced with dimensionality / capacity of β„±

VC dimension:

max 𝑑 s.t. β„± can fit every labels 𝑦 ∈ Β±1 𝑑 on every samples π‘₯1, … , π‘₯𝑑

Rademacher complexity:

max 𝑑 s.t. β„± can fit random labels 𝑦 ∼ Β±1 𝑑 on random π‘₯1, … , π‘₯𝑑 ∼ 𝑋

PAC Bayes:

𝑑 = 𝐼(𝐴 𝑆 ; 𝑆) where 𝑆 is training set.

Margin bounds:

max 𝑑 s.t. linear classifier satisfies 𝑀𝑖 , π‘₯𝑖 𝑦𝑖 > 𝑀𝑖 β‹… π‘₯𝑖 / 𝑑 for correct predictions

General form: If 𝐢 𝑓 β‰ͺ 𝑛 then αˆ˜β„’ 𝑓 βˆ’ β„’ 𝑓 β‰ˆ 0

Challenge: Many modern deep nets and natural 𝐢 , 𝐢 𝑓 ≫ 𝑛

Intuition: Can’t β€œoverfit” – if you do well on 𝑛 samples, must do well on population

Hope: Find better 𝐢?

ICLR 2017

General form: If 𝐢 𝑓 β‰ͺ 𝑛 then αˆ˜β„’ 𝑓 βˆ’ β„’ 𝑓 β‰ˆ 0

Challenge: Many modern deep nets and natural 𝐢 , 𝐢 𝑓 ≫ 𝑛

Intuition: Can’t β€œoverfit” – if you do well on 𝑛 samples, must do well on population

Hope: Find better 𝐢?

Classifier Train Test

Wide ResNet1

100% 92%

ResNet 182 100% 87%

Myrtle3 100% 86%

Performance on CIFAR-10 Performance on noisy CIFAR-10:*

Train Test

100% 10%

100% 10%

100% 10%

1 Zagoruyko-Komodakis’16 2 He-Zhang-Ren-Sun’15 3 Page ’19

Bansal-Kaplun-B, ICLR 2021

β€œDouble descent”

Belkin, Hsu, Ma, Mandal, PNAS 2019.

Nakkiran-Kaplun-Bansal-Yang-B-Sutskever, ICLR 2020

Example:

Data: (π‘₯𝑖 , π‘“βˆ— π‘₯𝑖 + 𝑁) where π‘“βˆ— degree π‘‘βˆ— polynomial

Algorithm: SGD to minimize Οƒ 𝑓 π‘₯𝑖 βˆ’ 𝑦𝑖2 where 𝑓 ∈ ℱ𝑑=degree 𝑑 polys

Cartoon:

π‘‘βˆ— 𝑛

degree 𝑑 = 𝐢(β„±)

Lo

ss/e

rro

r

Test

Train

Example:

Data: (π‘₯𝑖 , π‘“βˆ— π‘₯𝑖 + 𝑁) where π‘“βˆ— degree π‘‘βˆ— polynomial

Algorithm: SGD to minimize Οƒ 𝑓 π‘₯𝑖 βˆ’ 𝑦𝑖2 where 𝑓 ∈ ℱ𝑑=degree 𝑑 polys

Example:

Data: (π‘₯𝑖 , π‘“βˆ— π‘₯𝑖 + 𝑁) where π‘“βˆ— degree π‘‘βˆ— polynomial

Algorithm: SGD to minimize Οƒ 𝑓 π‘₯𝑖 βˆ’ 𝑦𝑖2 where 𝑓 ∈ ℱ𝑑=degree 𝑑 polys

Part III: Approximation and Representation

Hello

Hello

Example:

Example:

Hello

Hello

After Fourier Transform

Linear classifier

Fourier Transform

Every continuous 𝑓: 0,1 β†’ ℝ can be arbitrarily approximated

by 𝑔 of form

𝑔 π‘₯ = Οƒπ›Όπ‘—π‘’βˆ’2πœ‹π‘– 𝛽𝑗π‘₯

For some natural data, representation πœ‘ is β€œnice”

(e.g. sparse, linearly separable,…)

Tasks become simpler after transformation π‘₯ β†’ πœ‘(π‘₯)

i.e., 𝑓 β‰ˆ linear in πœ‘1 π‘₯ ,… , πœ‘π‘(π‘₯) where πœ‘π‘— π‘₯ = π‘’βˆ’2πœ‹π‘– 𝛽𝑗π‘₯

πœ‘:ℝ β†’ ℝ𝑁 embedding

ΰΆ±0

1

|𝑓 βˆ’ 𝑔| < πœ–

πœ‘ π‘₯, 𝑦 = (π‘₯𝑦, π‘₯2, 𝑦2)

Xavier Bourret Sicotte, https://xavierbourretsicotte.github.io/Kernel_feature_map.html

More approximation

𝑔(π‘₯) = σ𝛼𝑖 π‘…π‘’πΏπ‘ˆ(𝛽𝑖π‘₯ + 𝛾𝑖)

Proof by picture:

π‘…π‘’πΏπ‘ˆ π‘₯ = max(π‘₯, 0)

π‘Ž π‘Ž + 𝛿

π‘”π‘Ž π‘₯ = π‘…π‘’πΏπ‘ˆ(π‘₯/𝛿 βˆ’ π‘Ž)

π‘”π‘Ž+𝛿 π‘₯ = π‘…π‘’πΏπ‘ˆ(π‘₯/𝛿 βˆ’ π‘Ž βˆ’ 𝛿)

π‘Ž π‘Ž + 𝛿0

1

β„Žπ‘Ž = π‘”π‘Ž βˆ’ π‘”π‘Ž+𝛿 πΌπ‘Ž,𝑏 = β„Žπ‘Ž βˆ’ β„Žπ‘

Every continuous 𝑓: 0,1 β†’ ℝ can be arbitrarily approximated

by 𝑔 of form

π‘Ž 𝑏

More approximation

𝑔(π‘₯) = σ𝛼𝑖 π‘…π‘’πΏπ‘ˆ(𝛽𝑖π‘₯ + 𝛾𝑖)

Proof by picture:

π‘…π‘’πΏπ‘ˆ π‘₯ = max(π‘₯, 0)

πΌπ‘Ž,𝑏 = β„Žπ‘Ž βˆ’ β„Žπ‘

Every continuous 𝑓: 0,1 β†’ ℝ can be arbitrarily approximated

by 𝑔 of form

𝑓

𝑔

Higher dimension

𝐼(π‘₯)

𝐽 1,0 π‘₯, 𝑦 = 𝐼 π‘₯

π‘Ÿ π‘₯ = max( 𝛼, π‘₯ + 𝛽, 0) , 𝑑 + 1 parameters

Higher dimension

𝐼(π‘₯)

𝐽 1,0 π‘₯, 𝑦 = 𝐼 π‘₯

π‘Ÿ π‘₯ = max( 𝛼, π‘₯ + 𝛽, 0) , 𝑑 + 1 parameters

Higher dimension

𝐼(π‘₯)

π‘Ÿ π‘₯ = max( 𝛼, π‘₯ + 𝛽, 0) , 𝑑 + 1 parameters

How many ReLU’sEvery 𝑓: [0,1]𝑑→ ℝ can be approximated as sum of π‘Ÿπ›Ό,𝛽 π‘₯ = max( 𝛼, π‘₯ + 𝛽)

Can discretize ℝ𝑑+1 to 𝑂 1 𝑑 points – need at most 𝑂 1 𝑑 = 2𝑂 𝑑 ReLUs

(For formal proofs, see Cybenko β€˜89, Hornik β€˜91, Barron ’93; MIT 18.408)

THM: For some 𝑓’s , 2𝑐⋅𝑑 ReLU’s are necessary

0

1 1

1/21/2

Random 𝑓: 0,1 𝑑 β†’ ℝ

Choose 𝑓 π‘₯ ∈ {0,1} independently at random

for x ∈ 1/4, 3/4 𝑑

Extend linearly

LEMMA: For every fixed 𝑔: 0,1 𝑑 β†’ ℝ, Pr ∫ 𝑓 βˆ’ 𝑔 < 0.1 ≀ 2βˆ’πœ–β‹…2𝑑

THM: For some 𝑓’s , 2𝑐⋅𝑑 ReLU’s are necessary

0

1 1

1/21/2

Random 𝑓: 0,1 𝑑 β†’ ℝ

Choose 𝑓 π‘₯ ∈ {0,1} independently at random

for x ∈ 1/4, 3/4 𝑑

Extend linearly

LEMMA: For every fixed 𝑔: 0,1 𝑑 β†’ ℝ, Pr ∫ 𝑓 βˆ’ 𝑔 < 0.1 ≀ 2βˆ’πœ–β‹…2𝑑

LEMMA β‡’ THEORM: Need 2πœ–β‹…2𝑑

distinct 𝑔’s β‡’ need β‰ˆ 2𝑑 distinct ReLUs

PROOF OF LEMMA:

For π‘₯ ∈ 1/4,3/4 𝑑 let 𝑍π‘₯ = 1 iff ∫ 𝑔 𝑦 βˆ’ 𝑓 𝑦 𝑑𝑦 > 0.2 over y ∈ βˆ’1/4,+1/4 𝑑 + π‘₯

Each 𝑍π‘₯ is independent with expectation at least 1/2

β‡’ Pr[1

2𝑑σ𝑍π‘₯ < 1/4 ] ≀ 2βˆ’0.1β‹…2

𝑑 1

2𝑑σ𝑍π‘₯ β‰₯ 1/4 β‡’ ∫ 𝑔 βˆ’ 𝑓 β‰₯ 1/20

Bottom line

For every* function 𝑓:ℝ𝑑 β†’ ℝ many choices of πœ‘:ℝ𝑑 β†’ ℝ𝑁 s.t.

𝑓 β‰ˆ πΏπ‘–π‘›π‘’π‘Žπ‘Ÿπ‘“ ∘ πœ‘

Want to find πœ‘ that is useful for many interesting functions 𝑓

useful:

β€’ 𝑁 is not too big

β€’ Efficiently compute πœ‘(π‘₯) or βŸ¨πœ‘ π‘₯ , πœ‘ 𝑦 ⟩

β€’ βŸ¨πœ‘ π‘₯ , πœ‘ 𝑦 ⟩ large ⇔ π‘₯ and 𝑦 are β€œsemantically similar”

β€’ For interesting 𝑓’s, coefficients of πΏπ‘–π‘›π‘’π‘Žπ‘Ÿπ‘“ are structured.

β€’ …

Or 𝑓’s such that πœ“ ∘ 𝑓 interesting for

some non-linear πœ“:ℝ β†’ ℝ

Part IV: Kernels

Neural Net

πœ‘π‘₯ 𝐿

Kernel

πœ‘π‘₯ 𝐿

Definitely NN Definitely Kernel

πœ‘ hand-

crafted

before 1990s

πœ‘ trained

end-to-end

for task at

hand

πœ‘ pretrained

with same

data

πœ‘ pretrained

on different

data

πœ‘ pretrained

on synthetic

data

πœ‘ hand-

crafted based

on neural

architectures

Kernel methods (intuitively)

Distance measure 𝐾 π‘₯, π‘₯β€²

Input: π‘₯1, 𝑦1 , … , (π‘₯𝑛, 𝑦𝑛) s.t. π‘“βˆ— π‘₯𝑖 β‰ˆ 𝑦𝑖

To approx π‘“βˆ— π‘₯ output 𝑦𝑖 for π‘₯𝑖 closest to π‘₯

or output σ𝛼𝑖𝑦𝑖 with 𝛼𝑖 depending on 𝐾(π‘₯, π‘₯𝑖)

Hilbert SpaceLinear space: 𝑣 + 𝑒, 𝑐 β‹… 𝑣

Dot product: 𝑒, 𝑣 + 𝑐 β‹… 𝑀 = 𝑒, 𝑣 + 𝑐 𝑒,𝑀 , 𝑒, 𝑣 = 𝑣, 𝑒 , 𝑣, 𝑣 β‰₯ 0

Can solve linear equations of form { 𝑣𝑖 , π‘₯ = 𝑏𝑖 } knowing only βŸ¨π‘£π‘– , π‘£π‘—βŸ©

Also do least-square minimization minπ‘₯

Οƒ( 𝑣𝑖 , π‘₯ βˆ’ 𝑏𝑖)^2 knowing only βŸ¨π‘£π‘– , π‘£π‘—βŸ©

DEF: Sym matrix 𝐾 ∈ ℝ𝑛×𝑛 is p.s.d. if 𝑣𝑇𝐾𝑣 β‰₯ 0 for all 𝑣 ∈ ℝ𝑛

Equivalently πœ†1 𝐾 ,… , πœ†π‘› 𝐾 β‰₯ 0

CLAIM: 𝐾 is p.s.d. iff* 𝑒𝑇𝐾𝑣 is inner product

PROOF (β‡’): 𝑒𝑇𝐾𝑣 = πœ“ 𝑒 β‹… πœ“(𝑣) where πœ“ 𝑒1, … , 𝑒𝑛 = ( πœ†1𝑒1, … , πœ†π‘›π‘’π‘›) , expressed

in eigenbasis

Kernel MethodsGoal: Solve linear equations, least-square minimization, etc. under πœ‘

Observation: Enough to compute K x, xβ€² = βŸ¨πœ‘ π‘₯ , πœ‘ π‘₯β€² ⟩

Let ොπ‘₯ = πœ‘(π‘₯), given ොπ‘₯𝑖 , 𝑦𝑖 𝑖=1..𝑛 want to find ෝ𝑀 ∈ ℝ𝑛 s.t. ොπ‘₯𝑖 , ෝ𝑀 β‰ˆ 𝑦𝑖 βˆ€π‘–

can compute ෝ𝑀 = σ𝛼𝑖 ොπ‘₯𝑖 using 𝐾(π‘₯𝑖 , π‘₯𝑗)

To compute prediction on new π‘₯ using ෝ𝑀, can compute ොπ‘₯ = σ𝛽𝑖 ොπ‘₯𝑖 using 𝐾(π‘₯, π‘₯𝑖)

top related