![Page 1: Daniel Hsu Computer Science Department & Data Science ...djhsu/papers/interpolation-simons.pdf · The Elements of Statistical Learning Data Mining,Inference,and Prediction The Elements](https://reader034.vdocuments.mx/reader034/viewer/2022042321/5f0b912d7e708231d4312632/html5/thumbnails/1.jpg)
Predictive models from interpolation
Daniel Hsu
Computer Science Department & Data Science InstituteColumbia University
Simons Institute for the Theory of ComputingJune 3, 2019
![Page 2: Daniel Hsu Computer Science Department & Data Science ...djhsu/papers/interpolation-simons.pdf · The Elements of Statistical Learning Data Mining,Inference,and Prediction The Elements](https://reader034.vdocuments.mx/reader034/viewer/2022042321/5f0b912d7e708231d4312632/html5/thumbnails/2.jpg)
Spoilers
"A model with zero training error is overfit to the training data and will typically generalize poorly."
– Hastie, Tibshirani, & Friedman, The Elements of Statistical Learning
2
Springer Series in Statistics
Trevor HastieRobert TibshiraniJerome Friedman
Springer Series in Statistics
The Elements ofStatistical LearningData Mining, Inference, and Prediction
The Elements of Statistical Learning
During the past decade there has been an explosion in computation and information tech-nology. With it have come vast amounts of data in a variety of fields such as medicine, biolo-gy, finance, and marketing. The challenge of understanding these data has led to the devel-opment of new tools in the field of statistics, and spawned new areas such as data mining,machine learning, and bioinformatics. Many of these tools have common underpinnings butare often expressed with different terminology. This book describes the important ideas inthese areas in a common conceptual framework. While the approach is statistical, theemphasis is on concepts rather than mathematics. Many examples are given, with a liberaluse of color graphics. It should be a valuable resource for statisticians and anyone interestedin data mining in science or industry. The book’s coverage is broad, from supervised learning(prediction) to unsupervised learning. The many topics include neural networks, supportvector machines, classification trees and boosting—the first comprehensive treatment of thistopic in any book.
This major new edition features many topics not covered in the original, including graphicalmodels, random forests, ensemble methods, least angle regression & path algorithms for thelasso, non-negative matrix factorization, and spectral clustering. There is also a chapter onmethods for “wide” data (p bigger than n), including multiple testing and false discovery rates.
Trevor Hastie, Robert Tibshirani, and Jerome Friedman are professors of statistics atStanford University. They are prominent researchers in this area: Hastie and Tibshiranideveloped generalized additive models and wrote a popular book of that title. Hastie co-developed much of the statistical modeling software and environment in R/S-PLUS andinvented principal curves and surfaces. Tibshirani proposed the lasso and is co-author of thevery successful An Introduction to the Bootstrap. Friedman is the co-inventor of many data-mining tools including CART, MARS, projection pursuit and gradient boosting.
› springer.com
S T A T I S T I C S
ISBN 978-0-387-84857-0
Trevor Hastie • Robert Tibshirani • Jerome FriedmanThe Elements of Statictical Learning
Hastie • Tibshirani • Friedman
Second Edition
We'll give empirical and theoretical evidence against this conventional wisdom, at least in "modern" settings of machine learning.
![Page 3: Daniel Hsu Computer Science Department & Data Science ...djhsu/papers/interpolation-simons.pdf · The Elements of Statistical Learning Data Mining,Inference,and Prediction The Elements](https://reader034.vdocuments.mx/reader034/viewer/2022042321/5f0b912d7e708231d4312632/html5/thumbnails/3.jpg)
Outline
1. Observations that counter the conventional wisdom2. Risk bounds for prediction rules that interpolate3. Speculation about some neural networks
3
![Page 4: Daniel Hsu Computer Science Department & Data Science ...djhsu/papers/interpolation-simons.pdf · The Elements of Statistical Learning Data Mining,Inference,and Prediction The Elements](https://reader034.vdocuments.mx/reader034/viewer/2022042321/5f0b912d7e708231d4312632/html5/thumbnails/4.jpg)
Supervised learning
4
Learning algorithm
Training data (labeled examples)!", $" , … , (!', $') from )×+
Prediction function,-:) → +
Test point!′ ∈ )
Predicted label,- !′ ∈ +
/t/
/k/ /a/…
2 ← 2 − 5∇ 7ℛ(2)
(IID from 9)
Risk: ℛ - ≔ ; ℓ - != , $=where !′, $′ ∼ 9
![Page 5: Daniel Hsu Computer Science Department & Data Science ...djhsu/papers/interpolation-simons.pdf · The Elements of Statistical Learning Data Mining,Inference,and Prediction The Elements](https://reader034.vdocuments.mx/reader034/viewer/2022042321/5f0b912d7e708231d4312632/html5/thumbnails/5.jpg)
Modern machine learning algorithms
• Choose (parameterized) function class ℱ ⊂ #$• E.g., linear functions, polynomials, neural networks with certain architecture
• Use optimization algorithm to (attempt to) minimize empirical risk
%ℛ ' ≔ 1*+,-.
/ℓ ' 1, , 3,
(a.k.a. training error).
• But how "big" or "complex" should this function class be?(Degree of polynomial, size of neural network architecture, …)
5
![Page 6: Daniel Hsu Computer Science Department & Data Science ...djhsu/papers/interpolation-simons.pdf · The Elements of Statistical Learning Data Mining,Inference,and Prediction The Elements](https://reader034.vdocuments.mx/reader034/viewer/2022042321/5f0b912d7e708231d4312632/html5/thumbnails/6.jpg)
Overfitting
6
True risk
Empirical risk
Model complexity
![Page 7: Daniel Hsu Computer Science Department & Data Science ...djhsu/papers/interpolation-simons.pdf · The Elements of Statistical Learning Data Mining,Inference,and Prediction The Elements](https://reader034.vdocuments.mx/reader034/viewer/2022042321/5f0b912d7e708231d4312632/html5/thumbnails/7.jpg)
Generalization theory
• Generalization theory explains how overfitting can be avoided• Most basic form:
! max%∈ℱ
ℛ(*) − -ℛ(*) ≲ Complexity(ℱ)7
• Complexity of 8 can be measured in many ways:• Combinatorial parameter (e.g., Vapnik-Chervonenkis dimension)• Log-covering number in 9: ; metric• Rademacher complexity (supremum of Rademacher process)• Functional / parameter norms (e.g., Reproducing Kernel Hilbert Space norm)• …
7
![Page 8: Daniel Hsu Computer Science Department & Data Science ...djhsu/papers/interpolation-simons.pdf · The Elements of Statistical Learning Data Mining,Inference,and Prediction The Elements](https://reader034.vdocuments.mx/reader034/viewer/2022042321/5f0b912d7e708231d4312632/html5/thumbnails/8.jpg)
Observations from the field(Belkin, Ma, & Mandal, 2018)
8
Neural nets & kernel machines:• Can fit any training data.• Can generalize even when
training data has substantial amount of label noise.
![Page 9: Daniel Hsu Computer Science Department & Data Science ...djhsu/papers/interpolation-simons.pdf · The Elements of Statistical Learning Data Mining,Inference,and Prediction The Elements](https://reader034.vdocuments.mx/reader034/viewer/2022042321/5f0b912d7e708231d4312632/html5/thumbnails/9.jpg)
Overfitting or perfect fitting?
• Training produces a function !" that perfectly fits noisy training data.• !" is likely a very complex function!
• Yet, test error of !" is non-trivial: e.g., noise rate + 5%.
9
Existing generalization bounds appear uninformative for function classes that can interpolate noisy data.• !" chosen from class rich enough to express all possible
ways to label Ω(%) training examples.• Instead, must exploit properties of how !" is chosen.
![Page 10: Daniel Hsu Computer Science Department & Data Science ...djhsu/papers/interpolation-simons.pdf · The Elements of Statistical Learning Data Mining,Inference,and Prediction The Elements](https://reader034.vdocuments.mx/reader034/viewer/2022042321/5f0b912d7e708231d4312632/html5/thumbnails/10.jpg)
Existing theory about local interpolation
Nearest neighbor (Cover & Hart, 1967)
• Predict with label of nearest training example• Interpolates training data• Risk → 2 ⋅ ℛ(&∗) (sort of)
Hilbert kernel (Devroye, Györfi, & Krzyżak, 1998)
• Special kind of smoothing kernel regression (like Shepard's method)
• Interpolates training data• Consistent, but no convergence rates
) * − *, = 1* − *, /
10
![Page 11: Daniel Hsu Computer Science Department & Data Science ...djhsu/papers/interpolation-simons.pdf · The Elements of Statistical Learning Data Mining,Inference,and Prediction The Elements](https://reader034.vdocuments.mx/reader034/viewer/2022042321/5f0b912d7e708231d4312632/html5/thumbnails/11.jpg)
Our goals
• Further counter the "conventional wisdom" re: interpolationShow interpolation methods can be consistent (or almost consistent) for classification & regression problems• Simplicial interpolation• Weighted & interpolated nearest neighbor
• Identify some useful properties of good interpolation methods• Suggest connections to practical methods
11
![Page 12: Daniel Hsu Computer Science Department & Data Science ...djhsu/papers/interpolation-simons.pdf · The Elements of Statistical Learning Data Mining,Inference,and Prediction The Elements](https://reader034.vdocuments.mx/reader034/viewer/2022042321/5f0b912d7e708231d4312632/html5/thumbnails/12.jpg)
Simplicial interpolation
12
![Page 13: Daniel Hsu Computer Science Department & Data Science ...djhsu/papers/interpolation-simons.pdf · The Elements of Statistical Learning Data Mining,Inference,and Prediction The Elements](https://reader034.vdocuments.mx/reader034/viewer/2022042321/5f0b912d7e708231d4312632/html5/thumbnails/13.jpg)
Basic idea
• Construct estimate "̂ of the regression function" # = % &' #' = #
• Regression function " is minimizer of risk for squared lossℓ )&, & = )& − & ,
• For binary classification - = {0,1}:• " # = Pr(&' = 1 ∣ #' = #)• Optimal classifier: 7∗ # = 9: ; <=>• We'll construct plug-in classifier ?@ # = 9A: ; <=>
based on "̂
13
![Page 14: Daniel Hsu Computer Science Department & Data Science ...djhsu/papers/interpolation-simons.pdf · The Elements of Statistical Learning Data Mining,Inference,and Prediction The Elements](https://reader034.vdocuments.mx/reader034/viewer/2022042321/5f0b912d7e708231d4312632/html5/thumbnails/14.jpg)
Consistency and convergence rates
Questions of interest:• What is the (expected) risk of !" as # → ∞? Is it near optimal (ℛ((∗))?• What what rate (as function of #) does + ℛ !" approach ℛ((∗)?
14
![Page 15: Daniel Hsu Computer Science Department & Data Science ...djhsu/papers/interpolation-simons.pdf · The Elements of Statistical Learning Data Mining,Inference,and Prediction The Elements](https://reader034.vdocuments.mx/reader034/viewer/2022042321/5f0b912d7e708231d4312632/html5/thumbnails/15.jpg)
Interpolation via multivariate triangulation
• IID training examples !", $" , … , !&, $& ∈ ℝ)×[0,1]• Partition / ≔ conv !", … , !& into simplices with !5 as vertices via Delaunay.• Define 7̂(!) on each simplex by affine interpolation of vertices' labels.• Result is piecewise linear on /. (Punt on what happens outside of /.)
• For classification ($ ∈ {0,1}), let <= be plug-in classifier based on 7̂.
15
![Page 16: Daniel Hsu Computer Science Department & Data Science ...djhsu/papers/interpolation-simons.pdf · The Elements of Statistical Learning Data Mining,Inference,and Prediction The Elements](https://reader034.vdocuments.mx/reader034/viewer/2022042321/5f0b912d7e708231d4312632/html5/thumbnails/16.jpg)
!"
!#
!$
What happens on a single simplex
• Simplex on !", … , !'(" with corresponding labels )", … , )'("• Test point ! in simplex, with barycentric coordinates (+", … ,+'(").• Linear interpolation at ! (i.e., least squares fit, evaluated at !):
.̂ ! = 012"
'("+1)1
!
16
Key idea: aggregates information from all vertices to make prediction.(C.f. nearest neighbor rule.)
![Page 17: Daniel Hsu Computer Science Department & Data Science ...djhsu/papers/interpolation-simons.pdf · The Elements of Statistical Learning Data Mining,Inference,and Prediction The Elements](https://reader034.vdocuments.mx/reader034/viewer/2022042321/5f0b912d7e708231d4312632/html5/thumbnails/17.jpg)
Comparison to nearest neighbor rule
• Suppose ! " = Pr(' = 1 ∣ ") < 1/2 for all points in a simplex• Optimal prediction of .∗ is 0 for all points in simplex.
• Suppose '0 = ⋯ = '2 = 0, but '240 = 1 (due to "label noise")
x1
x3x2
0
0 1
Nearest neighbor rule
x1
x3x2
0
0 1
Simplicial interpolation
56 " = 1 hereEffect is exponentially more pronounced in high dimensions!
17
![Page 18: Daniel Hsu Computer Science Department & Data Science ...djhsu/papers/interpolation-simons.pdf · The Elements of Statistical Learning Data Mining,Inference,and Prediction The Elements](https://reader034.vdocuments.mx/reader034/viewer/2022042321/5f0b912d7e708231d4312632/html5/thumbnails/18.jpg)
Asymptotic risk (binary classification)Theorem: Assume distribution of !′ is uniform on some convex set, and # is bounded away from 1/2. Then simplicial interpolation's plug-in classifier '( satisfies
limsup/
0 ℛ( '() ≤ 1 + 678 9 ⋅ ℛ ;∗
18
• Near-consistency in high-dimension• C.f. nearest neighbor classifier: limsup
/0 ℛ( '() ≈ 2 ⋅ ℛ ;∗
• "Blessing" of dimensionality (with caveat about convergence rate).• Also have analysis for regression + classification w/o condition on #
[Belkin, H., Mitra, '18]
![Page 19: Daniel Hsu Computer Science Department & Data Science ...djhsu/papers/interpolation-simons.pdf · The Elements of Statistical Learning Data Mining,Inference,and Prediction The Elements](https://reader034.vdocuments.mx/reader034/viewer/2022042321/5f0b912d7e708231d4312632/html5/thumbnails/19.jpg)
Weighted & interpolated NN
19
![Page 20: Daniel Hsu Computer Science Department & Data Science ...djhsu/papers/interpolation-simons.pdf · The Elements of Statistical Learning Data Mining,Inference,and Prediction The Elements](https://reader034.vdocuments.mx/reader034/viewer/2022042321/5f0b912d7e708231d4312632/html5/thumbnails/20.jpg)
Weighted & interpolated NN scheme
• For given test point !, let !(#), … , ! ' be ( nearest neighbors in training data, and let )(#), … , ) ' be corresponding labels.
20
!(#)
!(*)!(')
!
Define
,̂ ! = ∑/0#' 1(!, ! / ) ) /∑/0#' 1(!, ! / )
where1 !, ! / = ! − ! /
34, 5 > 0
Interpolation: ,̂ ! → )/ as ! → !/
![Page 21: Daniel Hsu Computer Science Department & Data Science ...djhsu/papers/interpolation-simons.pdf · The Elements of Statistical Learning Data Mining,Inference,and Prediction The Elements](https://reader034.vdocuments.mx/reader034/viewer/2022042321/5f0b912d7e708231d4312632/html5/thumbnails/21.jpg)
Comparison to Hilbert kernel estimate
Weighted & interpolated NN Hilbert kernel (Devroye, Györfi, & Krzyżak, 1998)
"̂ # = ∑&'() *(#, # & ) . &∑&'() *(#, # & )
*(#, # & ) = ‖# − # & ‖12
Our analysis allows 0 < 5 < 6/2
"̂ # = ∑&'(9 *(#, #&) .&∑&'(9 *(#, #&)
* #, #& = # − #& 12
MUST have 5 = 6 for consistency
21
Localization is essential to get non-asymptotic rate.
![Page 22: Daniel Hsu Computer Science Department & Data Science ...djhsu/papers/interpolation-simons.pdf · The Elements of Statistical Learning Data Mining,Inference,and Prediction The Elements](https://reader034.vdocuments.mx/reader034/viewer/2022042321/5f0b912d7e708231d4312632/html5/thumbnails/22.jpg)
Convergence rates (regression)Theorem: Assume distribution of !′ is uniform on some compact set satisfying regularity condition, and # is $-Holder smooth.
For appropriate setting of %, weighted & interpolated NN estimate #̂satisfies
' ℛ #̂ ≤ ℛ # + + ,-.//(./23)
22
• Consistency + optimal rates of convergence for interpolating method.• Also get consistency and rates for classification.• Follow-up work by Belkin, Rakhlin, Tsybakov '19: also for Nadaraya-Watson
with compact & singular kernel.
[Belkin, H., Mitra, '18]
![Page 23: Daniel Hsu Computer Science Department & Data Science ...djhsu/papers/interpolation-simons.pdf · The Elements of Statistical Learning Data Mining,Inference,and Prediction The Elements](https://reader034.vdocuments.mx/reader034/viewer/2022042321/5f0b912d7e708231d4312632/html5/thumbnails/23.jpg)
What about the other NN?
23
![Page 24: Daniel Hsu Computer Science Department & Data Science ...djhsu/papers/interpolation-simons.pdf · The Elements of Statistical Learning Data Mining,Inference,and Prediction The Elements](https://reader034.vdocuments.mx/reader034/viewer/2022042321/5f0b912d7e708231d4312632/html5/thumbnails/24.jpg)
Two layer fully-connected neural networks
24
[Belkin, H., Ma, Mandal, '18]
Random first layer Trained first layer
![Page 25: Daniel Hsu Computer Science Department & Data Science ...djhsu/papers/interpolation-simons.pdf · The Elements of Statistical Learning Data Mining,Inference,and Prediction The Elements](https://reader034.vdocuments.mx/reader034/viewer/2022042321/5f0b912d7e708231d4312632/html5/thumbnails/25.jpg)
"Double descent" risk curve
25
[Belkin, H., Ma, Mandal, '18]
![Page 26: Daniel Hsu Computer Science Department & Data Science ...djhsu/papers/interpolation-simons.pdf · The Elements of Statistical Learning Data Mining,Inference,and Prediction The Elements](https://reader034.vdocuments.mx/reader034/viewer/2022042321/5f0b912d7e708231d4312632/html5/thumbnails/26.jpg)
Linear regression w/ weak features
26
Number of features chosen
R(↵)<latexit sha1_base64="G3+rf6RqL6ugJ3aBEV4KtdSPiRA=">AAACBXicbVBNS8NAEN34WetX1KMegkWol5KIoMeiF49V7Ac0IUy2m3bpJll2N0IJuXjxr3jxoIhX/4M3/42bNgdtfTDweG+GmXkBZ1Qq2/42lpZXVtfWKxvVza3tnV1zb78jk1Rg0sYJS0QvAEkYjUlbUcVIjwsCUcBINxhfF373gQhJk/heTTjxIhjGNKQYlJZ888iNQI0wsOwu9zN3DJxDXneB8RGc+mbNbthTWIvEKUkNlWj55pc7SHAakVhhBlL2HZsrLwOhKGYkr7qpJBzwGIakr2kMEZFeNv0it060MrDCROiKlTVVf09kEEk5iQLdWdws571C/M/rpyq89DIa81SRGM8WhSmzVGIVkVgDKghWbKIJYEH1rRYegQCsdHBVHYIz//Ii6Zw1HLvh3J7XmldlHBV0iI5RHTnoAjXRDWqhNsLoET2jV/RmPBkvxrvxMWtdMsqZA/QHxucPwoGYtg==</latexit><latexit sha1_base64="G3+rf6RqL6ugJ3aBEV4KtdSPiRA=">AAACBXicbVBNS8NAEN34WetX1KMegkWol5KIoMeiF49V7Ac0IUy2m3bpJll2N0IJuXjxr3jxoIhX/4M3/42bNgdtfTDweG+GmXkBZ1Qq2/42lpZXVtfWKxvVza3tnV1zb78jk1Rg0sYJS0QvAEkYjUlbUcVIjwsCUcBINxhfF373gQhJk/heTTjxIhjGNKQYlJZ888iNQI0wsOwu9zN3DJxDXneB8RGc+mbNbthTWIvEKUkNlWj55pc7SHAakVhhBlL2HZsrLwOhKGYkr7qpJBzwGIakr2kMEZFeNv0it060MrDCROiKlTVVf09kEEk5iQLdWdws571C/M/rpyq89DIa81SRGM8WhSmzVGIVkVgDKghWbKIJYEH1rRYegQCsdHBVHYIz//Ii6Zw1HLvh3J7XmldlHBV0iI5RHTnoAjXRDWqhNsLoET2jV/RmPBkvxrvxMWtdMsqZA/QHxucPwoGYtg==</latexit><latexit sha1_base64="G3+rf6RqL6ugJ3aBEV4KtdSPiRA=">AAACBXicbVBNS8NAEN34WetX1KMegkWol5KIoMeiF49V7Ac0IUy2m3bpJll2N0IJuXjxr3jxoIhX/4M3/42bNgdtfTDweG+GmXkBZ1Qq2/42lpZXVtfWKxvVza3tnV1zb78jk1Rg0sYJS0QvAEkYjUlbUcVIjwsCUcBINxhfF373gQhJk/heTTjxIhjGNKQYlJZ888iNQI0wsOwu9zN3DJxDXneB8RGc+mbNbthTWIvEKUkNlWj55pc7SHAakVhhBlL2HZsrLwOhKGYkr7qpJBzwGIakr2kMEZFeNv0it060MrDCROiKlTVVf09kEEk5iQLdWdws571C/M/rpyq89DIa81SRGM8WhSmzVGIVkVgDKghWbKIJYEH1rRYegQCsdHBVHYIz//Ii6Zw1HLvh3J7XmldlHBV0iI5RHTnoAjXRDWqhNsLoET2jV/RmPBkvxrvxMWtdMsqZA/QHxucPwoGYtg==</latexit><latexit sha1_base64="G3+rf6RqL6ugJ3aBEV4KtdSPiRA=">AAACBXicbVBNS8NAEN34WetX1KMegkWol5KIoMeiF49V7Ac0IUy2m3bpJll2N0IJuXjxr3jxoIhX/4M3/42bNgdtfTDweG+GmXkBZ1Qq2/42lpZXVtfWKxvVza3tnV1zb78jk1Rg0sYJS0QvAEkYjUlbUcVIjwsCUcBINxhfF373gQhJk/heTTjxIhjGNKQYlJZ888iNQI0wsOwu9zN3DJxDXneB8RGc+mbNbthTWIvEKUkNlWj55pc7SHAakVhhBlL2HZsrLwOhKGYkr7qpJBzwGIakr2kMEZFeNv0it060MrDCROiKlTVVf09kEEk5iQLdWdws571C/M/rpyq89DIa81SRGM8WhSmzVGIVkVgDKghWbKIJYEH1rRYegQCsdHBVHYIz//Ii6Zw1HLvh3J7XmldlHBV0iI5RHTnoAjXRDWqhNsLoET2jV/RmPBkvxrvxMWtdMsqZA/QHxucPwoGYtg==</latexit>
Ew,✓⇤ Error<latexit sha1_base64="b+Ytr3XWSnr8N38ZMyOL/kC6yrU=">AAACD3icbVDLSsNAFJ34rPUVdekmWBQRKYkIuixKwWUF+4Amhsl02g6dPJi5UUqIX+DGX3HjQhG3bt35N07SLLT1wIXDOfdy7z1exJkE0/zW5uYXFpeWSyvl1bX1jU19a7slw1gQ2iQhD0XHw5JyFtAmMOC0EwmKfY/Ttje6zPz2HRWShcENjCPq+HgQsD4jGJTk6ge2j2HoeUk9dZP7YxuGFPDtUfqQ68JP6kKEInX1ilk1cxizxCpIBRVouPqX3QtJ7NMACMdSdi0zAifBAhjhNC3bsaQRJiM8oF1FA+xT6ST5P6mxr5Se0Q+FqgCMXP09kWBfyrHvqc7sSjntZeJ/XjeG/rmTsCCKgQZksqgfcwNCIwvH6DFBCfCxIpgIpm41yBALTEBFWFYhWNMvz5LWSdUyq9b1aaV2UcRRQrtoDx0iC52hGrpCDdREBD2iZ/SK3rQn7UV71z4mrXNaMbOD/kD7/AFrvZ2H</latexit><latexit sha1_base64="b+Ytr3XWSnr8N38ZMyOL/kC6yrU=">AAACD3icbVDLSsNAFJ34rPUVdekmWBQRKYkIuixKwWUF+4Amhsl02g6dPJi5UUqIX+DGX3HjQhG3bt35N07SLLT1wIXDOfdy7z1exJkE0/zW5uYXFpeWSyvl1bX1jU19a7slw1gQ2iQhD0XHw5JyFtAmMOC0EwmKfY/Ttje6zPz2HRWShcENjCPq+HgQsD4jGJTk6ge2j2HoeUk9dZP7YxuGFPDtUfqQ68JP6kKEInX1ilk1cxizxCpIBRVouPqX3QtJ7NMACMdSdi0zAifBAhjhNC3bsaQRJiM8oF1FA+xT6ST5P6mxr5Se0Q+FqgCMXP09kWBfyrHvqc7sSjntZeJ/XjeG/rmTsCCKgQZksqgfcwNCIwvH6DFBCfCxIpgIpm41yBALTEBFWFYhWNMvz5LWSdUyq9b1aaV2UcRRQrtoDx0iC52hGrpCDdREBD2iZ/SK3rQn7UV71z4mrXNaMbOD/kD7/AFrvZ2H</latexit><latexit sha1_base64="b+Ytr3XWSnr8N38ZMyOL/kC6yrU=">AAACD3icbVDLSsNAFJ34rPUVdekmWBQRKYkIuixKwWUF+4Amhsl02g6dPJi5UUqIX+DGX3HjQhG3bt35N07SLLT1wIXDOfdy7z1exJkE0/zW5uYXFpeWSyvl1bX1jU19a7slw1gQ2iQhD0XHw5JyFtAmMOC0EwmKfY/Ttje6zPz2HRWShcENjCPq+HgQsD4jGJTk6ge2j2HoeUk9dZP7YxuGFPDtUfqQ68JP6kKEInX1ilk1cxizxCpIBRVouPqX3QtJ7NMACMdSdi0zAifBAhjhNC3bsaQRJiM8oF1FA+xT6ST5P6mxr5Se0Q+FqgCMXP09kWBfyrHvqc7sSjntZeJ/XjeG/rmTsCCKgQZksqgfcwNCIwvH6DFBCfCxIpgIpm41yBALTEBFWFYhWNMvz5LWSdUyq9b1aaV2UcRRQrtoDx0iC52hGrpCDdREBD2iZ/SK3rQn7UV71z4mrXNaMbOD/kD7/AFrvZ2H</latexit><latexit sha1_base64="b+Ytr3XWSnr8N38ZMyOL/kC6yrU=">AAACD3icbVDLSsNAFJ34rPUVdekmWBQRKYkIuixKwWUF+4Amhsl02g6dPJi5UUqIX+DGX3HjQhG3bt35N07SLLT1wIXDOfdy7z1exJkE0/zW5uYXFpeWSyvl1bX1jU19a7slw1gQ2iQhD0XHw5JyFtAmMOC0EwmKfY/Ttje6zPz2HRWShcENjCPq+HgQsD4jGJTk6ge2j2HoeUk9dZP7YxuGFPDtUfqQ68JP6kKEInX1ilk1cxizxCpIBRVouPqX3QtJ7NMACMdSdi0zAifBAhjhNC3bsaQRJiM8oF1FA+xT6ST5P6mxr5Se0Q+FqgCMXP09kWBfyrHvqc7sSjntZeJ/XjeG/rmTsCCKgQZksqgfcwNCIwvH6DFBCfCxIpgIpm41yBALTEBFWFYhWNMvz5LWSdUyq9b1aaV2UcRRQrtoDx0iC52hGrpCDdREBD2iZ/SK3rQn7UV71z4mrXNaMbOD/kD7/AFrvZ2H</latexit>
0.0 0.2 0.4 0.6 0.8 1.0
05
1015
p/N
risk
E ErrorR(alpha)
0.0 0.2 0.4 0.6 0.8 1.0
02
46
810
↵ = p/N<latexit sha1_base64="dadLJSVMX+K2Z5x9P6fR+QzKsK0=">AAAB8XicbVBNS8NAEJ3Ur1q/qh69BIvgqSYi6EUoevEkFewHtqFMtpt26Waz7G6EEvovvHhQxKv/xpv/xm2bg1YfDDzem2FmXig508bzvpzC0vLK6lpxvbSxubW9U97da+okVYQ2SMIT1Q5RU84EbRhmOG1LRTEOOW2Fo+up33qkSrNE3JuxpEGMA8EiRtBY6aGLXA7xUp7c9soVr+rN4P4lfk4qkKPeK392+wlJYyoM4ah1x/ekCTJUhhFOJ6VuqqlEMsIB7VgqMKY6yGYXT9wjq/TdKFG2hHFn6s+JDGOtx3FoO2M0Q73oTcX/vE5qoosgY0KmhgoyXxSl3DWJO33f7TNFieFjS5AoZm91yRAVEmNDKtkQ/MWX/5LmadX3qv7dWaV2lcdRhAM4hGPw4RxqcAN1aAABAU/wAq+Odp6dN+d93lpw8pl9+AXn4xvuWZBq</latexit><latexit sha1_base64="dadLJSVMX+K2Z5x9P6fR+QzKsK0=">AAAB8XicbVBNS8NAEJ3Ur1q/qh69BIvgqSYi6EUoevEkFewHtqFMtpt26Waz7G6EEvovvHhQxKv/xpv/xm2bg1YfDDzem2FmXig508bzvpzC0vLK6lpxvbSxubW9U97da+okVYQ2SMIT1Q5RU84EbRhmOG1LRTEOOW2Fo+up33qkSrNE3JuxpEGMA8EiRtBY6aGLXA7xUp7c9soVr+rN4P4lfk4qkKPeK392+wlJYyoM4ah1x/ekCTJUhhFOJ6VuqqlEMsIB7VgqMKY6yGYXT9wjq/TdKFG2hHFn6s+JDGOtx3FoO2M0Q73oTcX/vE5qoosgY0KmhgoyXxSl3DWJO33f7TNFieFjS5AoZm91yRAVEmNDKtkQ/MWX/5LmadX3qv7dWaV2lcdRhAM4hGPw4RxqcAN1aAABAU/wAq+Odp6dN+d93lpw8pl9+AXn4xvuWZBq</latexit><latexit sha1_base64="dadLJSVMX+K2Z5x9P6fR+QzKsK0=">AAAB8XicbVBNS8NAEJ3Ur1q/qh69BIvgqSYi6EUoevEkFewHtqFMtpt26Waz7G6EEvovvHhQxKv/xpv/xm2bg1YfDDzem2FmXig508bzvpzC0vLK6lpxvbSxubW9U97da+okVYQ2SMIT1Q5RU84EbRhmOG1LRTEOOW2Fo+up33qkSrNE3JuxpEGMA8EiRtBY6aGLXA7xUp7c9soVr+rN4P4lfk4qkKPeK392+wlJYyoM4ah1x/ekCTJUhhFOJ6VuqqlEMsIB7VgqMKY6yGYXT9wjq/TdKFG2hHFn6s+JDGOtx3FoO2M0Q73oTcX/vE5qoosgY0KmhgoyXxSl3DWJO33f7TNFieFjS5AoZm91yRAVEmNDKtkQ/MWX/5LmadX3qv7dWaV2lcdRhAM4hGPw4RxqcAN1aAABAU/wAq+Odp6dN+d93lpw8pl9+AXn4xvuWZBq</latexit><latexit sha1_base64="dadLJSVMX+K2Z5x9P6fR+QzKsK0=">AAAB8XicbVBNS8NAEJ3Ur1q/qh69BIvgqSYi6EUoevEkFewHtqFMtpt26Waz7G6EEvovvHhQxKv/xpv/xm2bg1YfDDzem2FmXig508bzvpzC0vLK6lpxvbSxubW9U97da+okVYQ2SMIT1Q5RU84EbRhmOG1LRTEOOW2Fo+up33qkSrNE3JuxpEGMA8EiRtBY6aGLXA7xUp7c9soVr+rN4P4lfk4qkKPeK392+wlJYyoM4ah1x/ekCTJUhhFOJ6VuqqlEMsIB7VgqMKY6yGYXT9wjq/TdKFG2hHFn6s+JDGOtx3FoO2M0Q73oTcX/vE5qoosgY0KmhgoyXxSl3DWJO33f7TNFieFjS5AoZm91yRAVEmNDKtkQ/MWX/5LmadX3qv7dWaV2lcdRhAM4hGPw4RxqcAN1aAABAU/wAq+Odp6dN+d93lpw8pl9+AXn4xvuWZBq</latexit>
risk
[Belkin, H., Xu, '19; Xu, H., '19]
Gaussian design linear model with ! featuresAll features are "relevant" but equally weak
Only use " of the features (1 ≤ " ≤ !)Least squares (" ≤ %) or least norm (" ≥ %) fit
Concurrent work by Hastie, Montanari, Rosset, Tibshirani '19 (also for some non-linear models!),and by Muthukumar, Vodrahalli, Sahai, '19; Bartlett, Long, Lugosi, Tsigler, '19.
![Page 27: Daniel Hsu Computer Science Department & Data Science ...djhsu/papers/interpolation-simons.pdf · The Elements of Statistical Learning Data Mining,Inference,and Prediction The Elements](https://reader034.vdocuments.mx/reader034/viewer/2022042321/5f0b912d7e708231d4312632/html5/thumbnails/27.jpg)
What happens in the limit ! → ∞?
! → ∞ --- infinite width neural net.
(Note: number of data points $ is constant.)
ℎ&,( = lim-→(ℎ&,-
What is the classifier corresponding to this neural net?
Kernel machine!
ℎ&,( = ./012$3∈ℋ,3 67 897 ||ℎ||ℋ
Min norm interpolation in Gaussian RKHS ℋ.
RFF approximates Kernel Machine
Min RFF norm convergences to min ℋ norm
Kernel machine
Norm of the solution
Test error
Kernel machine
![Page 28: Daniel Hsu Computer Science Department & Data Science ...djhsu/papers/interpolation-simons.pdf · The Elements of Statistical Learning Data Mining,Inference,and Prediction The Elements](https://reader034.vdocuments.mx/reader034/viewer/2022042321/5f0b912d7e708231d4312632/html5/thumbnails/28.jpg)
Is infinite width optimal?
ℎ",$ appears to be near-optimal empirically.
Suppose ∀& '& = ℎ∗ *& for some ℎ∗ ∈ ℋ (Gaussian RKHS).
Theorem:
|ℎ∗(*) − ℎ",$(*)| = 1234 "/ 678 " 9/: ||ℎ∗||ℋ
Compare to ; <" for classical bias-variance
analyses.
[B., Hsu, Ma, Mandal, 18]
![Page 29: Daniel Hsu Computer Science Department & Data Science ...djhsu/papers/interpolation-simons.pdf · The Elements of Statistical Learning Data Mining,Inference,and Prediction The Elements](https://reader034.vdocuments.mx/reader034/viewer/2022042321/5f0b912d7e708231d4312632/html5/thumbnails/29.jpg)
Occams’s razor
Occam’s razor based on inductive bias:
Choose the smoothest function subject to interpolating the data.
Three ways to increase smoothness:
Ø Explicit: minimum functional norm solutions (RFF, ReLUfeatures).
Ø Implicit: SGD/optimization (Neural networks)
(All coincide for kernel machines)
![Page 30: Daniel Hsu Computer Science Department & Data Science ...djhsu/papers/interpolation-simons.pdf · The Elements of Statistical Learning Data Mining,Inference,and Prediction The Elements](https://reader034.vdocuments.mx/reader034/viewer/2022042321/5f0b912d7e708231d4312632/html5/thumbnails/30.jpg)
Conclusions and open problems
1. Interpolation is compatible with good statistical properties2. Need good inductive bias: e.g.,
1. Functions that do local averaging in high-dimensions.2. Low function space norm
Open problems• Formally characterize inductive bias of interpolation with existing
methods (e.g., neural nets, kernel machines, random forests)• Srebro: Simplicial interpolation = GD on infinite width ReLU network (d=1)
• Benefits of interpolation?
30
![Page 31: Daniel Hsu Computer Science Department & Data Science ...djhsu/papers/interpolation-simons.pdf · The Elements of Statistical Learning Data Mining,Inference,and Prediction The Elements](https://reader034.vdocuments.mx/reader034/viewer/2022042321/5f0b912d7e708231d4312632/html5/thumbnails/31.jpg)
Acknowledgements
• Collaborators:Misha Belkin, Siyuan Ma, Soumik Mandal, Partha Mitra, Ji Xu• Some slides borrowed from Misha!• National Science Foundation• Sloan Foundation• Simons Institute for the Theory of Computing
31