a fourier analysis perspective of training dynamics of ... · leo breiman reflections after...
TRANSCRIPT
![Page 1: A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After Refereeing Papers for NIPS. Frequency Principle. Conventional view of generalization. Conventional](https://reader034.vdocuments.mx/reader034/viewer/2022042807/5f79eb33fbabea4865136203/html5/thumbnails/1.jpg)
Frequency Principle
Yaoyu Zhang, Tao Luo
![Page 2: A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After Refereeing Papers for NIPS. Frequency Principle. Conventional view of generalization. Conventional](https://reader034.vdocuments.mx/reader034/viewer/2022042807/5f79eb33fbabea4865136203/html5/thumbnails/2.jpg)
•Background
• Frequency Principle
• Implication of F-Principle
•Quantitative theory for F-Principle
![Page 3: A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After Refereeing Papers for NIPS. Frequency Principle. Conventional view of generalization. Conventional](https://reader034.vdocuments.mx/reader034/viewer/2022042807/5f79eb33fbabea4865136203/html5/thumbnails/3.jpg)
Background—Why DNN remains a mystery?
![Page 4: A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After Refereeing Papers for NIPS. Frequency Principle. Conventional view of generalization. Conventional](https://reader034.vdocuments.mx/reader034/viewer/2022042807/5f79eb33fbabea4865136203/html5/thumbnails/4.jpg)
Example 1 (mystery)—Deep Learning
Given 𝒟: (𝑥𝑖 , 𝑦𝑖) 𝑖=1𝑛 and ℋ: {𝑓 ⋅; Θ |Θ ∈ ℝ𝑚}, find 𝑓 ∈
ℋ such that 𝑓 𝑥𝑖 = 𝑦𝑖 for 𝑖 = 1,⋯ , 𝑛.
Supervised Learning Problem
ℋ
𝒟
find
ሶ𝛩 = −𝛻𝛩𝐿 𝛩
Initialized by special 𝛩0
𝐿 𝛩 = 𝛴𝑖=1𝑛 ℎ 𝑥𝑖; 𝛩 − 𝑦𝑖
2/2𝑛
![Page 5: A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After Refereeing Papers for NIPS. Frequency Principle. Conventional view of generalization. Conventional](https://reader034.vdocuments.mx/reader034/viewer/2022042807/5f79eb33fbabea4865136203/html5/thumbnails/5.jpg)
Example 2 (well understood)—polynomial interpolation
Given 𝒟: (𝑥𝑖 , 𝑦𝑖) 𝑖=1𝑛 and ℋ: {𝑓 ⋅; Θ |Θ ∈ ℝ𝑚}, find 𝑓 ∈
ℋ such that 𝑓 𝑥𝑖 = 𝑦𝑖 for 𝑖 = 1,⋯ , 𝑛.
Supervised Learning Problem
𝒟(𝑥𝑖 ∈ ℝ, 𝑦𝑖 ∈ ℝ) 𝑖=1
𝑛
ℋℎ 𝑥;Θ = 𝜃1 +⋯+ 𝜃𝑀𝑥
𝑚−1
with 𝑚 = 𝑛
findNewton's interpolation formula
Q: Why we think polynomial interpolation is well understood?
![Page 6: A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After Refereeing Papers for NIPS. Frequency Principle. Conventional view of generalization. Conventional](https://reader034.vdocuments.mx/reader034/viewer/2022042807/5f79eb33fbabea4865136203/html5/thumbnails/6.jpg)
Example 3 (well understood)—linear spline
Given 𝒟: (𝑥𝑖 , 𝑦𝑖) 𝑖=1𝑛 and ℋ: {𝑓 ⋅; Θ |Θ ∈ ℝ𝑚}, find 𝑓 ∈
ℋ such that 𝑓 𝑥𝑖 = 𝑦𝑖 for 𝑖 = 1,⋯ , 𝑛.
Supervised Learning Problem
𝒟(𝑥𝑖 ∈ ℝ, 𝑦𝑖 ∈ ℝ) 𝑖=1
𝑛
ℋpiecewise linear functions
findexplicit solution
![Page 7: A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After Refereeing Papers for NIPS. Frequency Principle. Conventional view of generalization. Conventional](https://reader034.vdocuments.mx/reader034/viewer/2022042807/5f79eb33fbabea4865136203/html5/thumbnails/7.jpg)
Given 𝒟: (𝑥𝑖 , 𝑦𝑖) 𝑖=1𝑛 and ℋ: {𝑓 ⋅; Θ |Θ ∈ ℝ𝑚}, find 𝑓 ∈
ℋ such that 𝑓 𝑥𝑖 = 𝑦𝑖 for 𝑖 = 1,⋯ , 𝑛.
Why deep learning is a mystery?
Deep learning (black box!)
High dimensional real data (e.g., d=32*32*3)
Deep neural network (#para>>#data)
Gradient-based method with proper initialization
Conventional methods
Low dimensional data (𝑑 ≤ 3)
Spanned by simple basis functions (#para≤#data)
explicit formula
𝒟
ℋ
find
![Page 8: A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After Refereeing Papers for NIPS. Frequency Principle. Conventional view of generalization. Conventional](https://reader034.vdocuments.mx/reader034/viewer/2022042807/5f79eb33fbabea4865136203/html5/thumbnails/8.jpg)
Is deep learning alchemy?
https://beamandrew.github.io/deeplearning/2017/02/23/deep_learning_101_part1.html
![Page 9: A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After Refereeing Papers for NIPS. Frequency Principle. Conventional view of generalization. Conventional](https://reader034.vdocuments.mx/reader034/viewer/2022042807/5f79eb33fbabea4865136203/html5/thumbnails/9.jpg)
1960-1969
• Simple (#data small)
• Single-layer NN (cannot solve XOR)
• Non-Gradient based (nondiff activation)
1984-1996
• Moderate (e.g., MNIST)
• Multi-layer NN (universal approx)
• Gradient based (BP)
2010-now
• Complex real data (e.g., ImageNet)
• Deep NN
• Gradient based (BP) with good initialization
NN is still a black box!
Golden ages of neural network
![Page 10: A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After Refereeing Papers for NIPS. Frequency Principle. Conventional view of generalization. Conventional](https://reader034.vdocuments.mx/reader034/viewer/2022042807/5f79eb33fbabea4865136203/html5/thumbnails/10.jpg)
Leo Breiman 1995
1. Why don’t heavily parameterized neural networks overfit the data?
2. What is the effective number of parameters?
3. Why doesn’t backpropagation head for a poor local minima?
4. When should one stop the backpropagation and use the current parameters?
Leo Breiman Reflections After Refereeing Papers for NIPS
![Page 11: A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After Refereeing Papers for NIPS. Frequency Principle. Conventional view of generalization. Conventional](https://reader034.vdocuments.mx/reader034/viewer/2022042807/5f79eb33fbabea4865136203/html5/thumbnails/11.jpg)
Frequency Principle
![Page 12: A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After Refereeing Papers for NIPS. Frequency Principle. Conventional view of generalization. Conventional](https://reader034.vdocuments.mx/reader034/viewer/2022042807/5f79eb33fbabea4865136203/html5/thumbnails/12.jpg)
Conventional view of generalization
![Page 13: A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After Refereeing Papers for NIPS. Frequency Principle. Conventional view of generalization. Conventional](https://reader034.vdocuments.mx/reader034/viewer/2022042807/5f79eb33fbabea4865136203/html5/thumbnails/13.jpg)
Conventional view of generalization
"With four parameters you can fit an elephant to a curve; with five you can make him wiggle his trunk.”
-- John von Neumann
Mayer et al., 2010
A model that can fit anything likely overfits the data.
![Page 14: A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After Refereeing Papers for NIPS. Frequency Principle. Conventional view of generalization. Conventional](https://reader034.vdocuments.mx/reader034/viewer/2022042807/5f79eb33fbabea4865136203/html5/thumbnails/14.jpg)
Overparameterized DNNs often generalize well.
Cifar10: 60,000 training data
airplanes, cars, birds,
cats, deer, dogs, frogs,
horses, ships, and trucks
![Page 15: A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After Refereeing Papers for NIPS. Frequency Principle. Conventional view of generalization. Conventional](https://reader034.vdocuments.mx/reader034/viewer/2022042807/5f79eb33fbabea4865136203/html5/thumbnails/15.jpg)
Problem simplification
𝒟
ℋ
find ሶ𝛩 = −𝛻𝛩𝐿 𝛩
Initialized by special 𝛩0
1.0 0.5 0.0 0.5 1.0x
1
0
1
only observe𝑓 𝑥, 𝑡 : = 𝑓 𝑥; Θ(𝑡)
![Page 16: A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After Refereeing Papers for NIPS. Frequency Principle. Conventional view of generalization. Conventional](https://reader034.vdocuments.mx/reader034/viewer/2022042807/5f79eb33fbabea4865136203/html5/thumbnails/16.jpg)
Overparameterized DNNs still generalize well
Lei Wu, Zhanxing Zhu, Weinan E, 2017
#para(~1000)>>#data: 5
![Page 17: A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After Refereeing Papers for NIPS. Frequency Principle. Conventional view of generalization. Conventional](https://reader034.vdocuments.mx/reader034/viewer/2022042807/5f79eb33fbabea4865136203/html5/thumbnails/17.jpg)
evolution of 𝒇(𝒙, 𝒕)
tanh-DNN, 200-100-100-50
![Page 18: A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After Refereeing Papers for NIPS. Frequency Principle. Conventional view of generalization. Conventional](https://reader034.vdocuments.mx/reader034/viewer/2022042807/5f79eb33fbabea4865136203/html5/thumbnails/18.jpg)
Through the lens of Fourier transform 𝒉(𝝃, 𝒕)
Frequency Principle (F-Principle):DNNs often fit target functions from low to high frequencies during the training.
Xu, Zhang, Xiao, Training behavior of deep neural network in frequency domain, 2018
![Page 19: A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After Refereeing Papers for NIPS. Frequency Principle. Conventional view of generalization. Conventional](https://reader034.vdocuments.mx/reader034/viewer/2022042807/5f79eb33fbabea4865136203/html5/thumbnails/19.jpg)
Synthetic curve with equal amplitude
![Page 20: A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After Refereeing Papers for NIPS. Frequency Principle. Conventional view of generalization. Conventional](https://reader034.vdocuments.mx/reader034/viewer/2022042807/5f79eb33fbabea4865136203/html5/thumbnails/20.jpg)
Target: image 𝐼 𝐱 : ℝ2 → ℝ𝐱 : location of a pixel𝐼 𝐱 : grayscale pixel value
How DNN fits a 2-d image?
![Page 21: A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After Refereeing Papers for NIPS. Frequency Principle. Conventional view of generalization. Conventional](https://reader034.vdocuments.mx/reader034/viewer/2022042807/5f79eb33fbabea4865136203/html5/thumbnails/21.jpg)
F-Principle
High-dimensional real data?
Xu, Zhang, Luo, Xiao, Ma, Frequency Principle: Fourier Analysis Sheds Light on Deep Neural Networks, 2019
1.0 0.5 0.0 0.5 1.0x
1
0
1
?
![Page 22: A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After Refereeing Papers for NIPS. Frequency Principle. Conventional view of generalization. Conventional](https://reader034.vdocuments.mx/reader034/viewer/2022042807/5f79eb33fbabea4865136203/html5/thumbnails/22.jpg)
Frequency
Image frequency (NOT USED)
• This frequency corresponds to the rate of change of intensity across neighboring pixels.
Response frequency
• Frequency of a general Input-Output mapping 𝑓.
መ𝑓 𝐤 = න𝑓 𝐱 e−i2𝜋𝐤⋅𝐱 d𝐱
MNIST: ℝ784 → ℝ10, 𝐤 ∈ ℝ784
Zero freqSame color
high freqSharp edge
high freqAdversarial example
Goodfellow et al.
![Page 23: A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After Refereeing Papers for NIPS. Frequency Principle. Conventional view of generalization. Conventional](https://reader034.vdocuments.mx/reader034/viewer/2022042807/5f79eb33fbabea4865136203/html5/thumbnails/23.jpg)
Examining F-Principle for high dimensional real problems
Nonuniform Discrete Fourier transform (NUDFT) for training dataset (𝐱𝑖 , 𝑦𝑖) 𝑖=1
𝑛 :
ො𝑦𝐤 =1
𝑛σ𝑖=1𝑛 𝑦𝑖e
−i2𝜋𝐤⋅𝐱𝑖, ℎ𝐤(𝑡) =1
𝑛σ𝑖=1𝑛 ℎ(𝐱𝑖 , 𝑡)e
−i2𝜋𝐤⋅𝐱𝑖
Difficulty:
• Curse of dimensionality, i.e., #𝐤 grows exponentially with dimension of problem 𝑑.
Our approaches:
• Projection, i.e., choose 𝐤 = 𝑘𝐩1• Filtering
![Page 24: A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After Refereeing Papers for NIPS. Frequency Principle. Conventional view of generalization. Conventional](https://reader034.vdocuments.mx/reader034/viewer/2022042807/5f79eb33fbabea4865136203/html5/thumbnails/24.jpg)
Projection approach
MNIST
CIFAR10
Relative error: Δ𝐹 𝑘 = ℎ𝑘 − ො𝑦𝑘 /| ො𝑦𝑘|
![Page 25: A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After Refereeing Papers for NIPS. Frequency Principle. Conventional view of generalization. Conventional](https://reader034.vdocuments.mx/reader034/viewer/2022042807/5f79eb33fbabea4865136203/html5/thumbnails/25.jpg)
low-freq of ො𝑦
𝑘0
Decompose frequency domain by filtering
![Page 26: A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After Refereeing Papers for NIPS. Frequency Principle. Conventional view of generalization. Conventional](https://reader034.vdocuments.mx/reader034/viewer/2022042807/5f79eb33fbabea4865136203/html5/thumbnails/26.jpg)
F-Principle in high-dim spaceCIFAR10 CIFAR10MNIST
![Page 27: A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After Refereeing Papers for NIPS. Frequency Principle. Conventional view of generalization. Conventional](https://reader034.vdocuments.mx/reader034/viewer/2022042807/5f79eb33fbabea4865136203/html5/thumbnails/27.jpg)
F-Principle
Implication of F-Principle
Xu, Zhang, Xiao, Training behavior of deep neural network in frequency domain, 2018Xu, Zhang, Luo, Xiao, Ma, Frequency Principle: Fourier Analysis Sheds Light on Deep Neural Networks, 2019
![Page 28: A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After Refereeing Papers for NIPS. Frequency Principle. Conventional view of generalization. Conventional](https://reader034.vdocuments.mx/reader034/viewer/2022042807/5f79eb33fbabea4865136203/html5/thumbnails/28.jpg)
For Ԧ𝑥 ∈ −1,1 𝑛
𝒇 𝒙 = ς𝒋=𝟏𝒏 𝒙𝒋,
Even #‘-1’ → 1;
Odd #‘-1’ → -1.
Test accuracy: 72% %>>10% Test accuracy: ~50%, random guess
F-Principe: DNN prefers low frequencies
Why don’t heavily parameterized neural networks overfit the data?
CIFAR10 parity
![Page 29: A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After Refereeing Papers for NIPS. Frequency Principle. Conventional view of generalization. Conventional](https://reader034.vdocuments.mx/reader034/viewer/2022042807/5f79eb33fbabea4865136203/html5/thumbnails/29.jpg)
When should one stop the backpropagation and use the current parameters?
![Page 30: A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After Refereeing Papers for NIPS. Frequency Principle. Conventional view of generalization. Conventional](https://reader034.vdocuments.mx/reader034/viewer/2022042807/5f79eb33fbabea4865136203/html5/thumbnails/30.jpg)
Studies elicited by F-Principle• Theoretical study
• Rahaman, N., Arpit, D., Baratin, A., Draxler, F., Lin, M., Hamprecht, F. A., Bengio, Y. & Courville, A. (2018), ‘On the spectral bias of deep neural networks’.
• Jin, P., Lu, L., Tang, Y. & Karniadakis, G. E. (2019), ‘Quantifying the generalization error in deep learning in terms of data distribution and neural network smoothness’
• Basri, R., Jacobs, D., Kasten, Y. & Kritchman, S. (2019), ‘The convergence rate of neural networks for learned functions of different frequencies’
• Zhen, H.-L., Lin, X., Tang, A. Z., Li, Z., Zhang, Q. & Kwong, S. (2018), ‘Nonlinear collaborative scheme for deep neural networks’
• Wang, H., Wu, X., Yin, P. & Xing, E. P. (2019), ‘High frequency component helps explain the generalization of convolutional neural networks’
• Empirical study
• Jagtap, A. D. & Karniadakis, G. E. (2019), ‘Adaptive activation functions accelerate convergence in deep and physics-informed neural networks’
• Stamatescu, V. & McDonnell, M. D. (2018), ‘Diagnosing convolutional neural networks using their spectral response’
• Rabinowitz, N. C. (2019), ‘Meta-learners’ learning dynamics are unlike learners”,
• Application
• Wang, F., Müller, J., Eljarrat, A., Henninen, T., Rolf, E. & Koch, C. (2018), ‘Solving inverse problems with multi-scale deep convolutional neural networks’
• Cai, W., Li, X. & Liu, L. (2019), ‘Phasednn-a parallel phase shift deep neural network for adaptive wideband learning’
![Page 31: A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After Refereeing Papers for NIPS. Frequency Principle. Conventional view of generalization. Conventional](https://reader034.vdocuments.mx/reader034/viewer/2022042807/5f79eb33fbabea4865136203/html5/thumbnails/31.jpg)
Quantitative theory for F-Principle
Zhang, Xu, Luo, Ma, Explicitizing an Implicit Bias of the Frequency Principle in Two-layer Neural Networks, 2019
![Page 32: A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After Refereeing Papers for NIPS. Frequency Principle. Conventional view of generalization. Conventional](https://reader034.vdocuments.mx/reader034/viewer/2022042807/5f79eb33fbabea4865136203/html5/thumbnails/32.jpg)
The NTK regime
𝐿 Θ = σ𝑖=1𝑛 ℎ 𝑥𝑖; Θ − 𝑦𝑖
2
ሶΘ = −∇Θ𝐿 Θ
• 𝜕𝑡ℎ 𝑥; Θ = −σ𝑖=1𝑛 𝐾Θ 𝑥, 𝑥𝑖 ℎ 𝑥𝑖; Θ − 𝑦𝑖
Where 𝐾Θ 𝑥, 𝑥′ = ∇Θℎ 𝑥; Θ ⋅ ∇Θℎ 𝑥′; Θ
• Neural Tangent Kernel (NTK) regime:
𝐾Θ(t) 𝑥, 𝑥′ ≈ 𝐾Θ(0) 𝑥, 𝑥′ for any t.
Jacot et al., 2018
![Page 33: A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After Refereeing Papers for NIPS. Frequency Principle. Conventional view of generalization. Conventional](https://reader034.vdocuments.mx/reader034/viewer/2022042807/5f79eb33fbabea4865136203/html5/thumbnails/33.jpg)
Problem simplification
𝒟
ℋ
find ሶ𝛩 = −𝛻𝛩𝐿 𝛩
Initialized by special 𝛩0
1.0 0.5 0.0 0.5 1.0x
1
0
1
Kernel gradient flow𝜕𝑡𝑓 𝑥, 𝑡= −Σ𝑖=1
𝑛 𝐾Θ0 𝑥, 𝑥𝑖 𝑓 𝑥𝑖 , 𝑡 − 𝑦𝑖
Two-layer ReLU NN
ℎ 𝑥; 𝛩 =𝑖=1
𝑛
𝑤𝑖𝜎 (𝑟𝑖(𝑥 + 𝑙𝑖))
![Page 34: A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After Refereeing Papers for NIPS. Frequency Principle. Conventional view of generalization. Conventional](https://reader034.vdocuments.mx/reader034/viewer/2022042807/5f79eb33fbabea4865136203/html5/thumbnails/34.jpg)
Linear F-Principle (LFP) dynamics
2-layer NN: ℎ 𝑥; 𝛩 = σ𝑖=1𝑛 𝑤𝑖ReLU(𝑟𝑖(𝑥 + 𝑙𝑖))
⋅ : mean over all neurons at initialization
𝑓: target function; ⋅ 𝑝 = (⋅)𝑝,where 𝑝 𝑥 =1
𝑛σ𝑖=1𝑛 𝛿(𝑥 − 𝑥𝑖);
Ƹ⋅: Fourier transform; 𝜉: frequency
ReLU
aliasing
𝜕𝑡 ℎ 𝜉, 𝑡 = −4𝜋2 𝑟2𝑤2
𝜉2+
𝑟2 + 𝑤2
𝜉4ℎ𝑝 𝜉, 𝑡 − 𝑓𝑝 𝜉, 𝑡
Assumptions: (i) NTK regime, (ii) sufficiently wide distribution of 𝑙𝑖.
![Page 35: A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After Refereeing Papers for NIPS. Frequency Principle. Conventional view of generalization. Conventional](https://reader034.vdocuments.mx/reader034/viewer/2022042807/5f79eb33fbabea4865136203/html5/thumbnails/35.jpg)
𝜕𝑡 ℎ 𝜉, 𝑡 = −4𝜋2 𝑟2𝑤2
𝜉2+
𝑟2 + 𝑤2
𝜉4ℎ𝑝 𝜉, 𝑡 − 𝑓𝑝 𝜉, 𝑡
Preference induced by LFP dynamics
minℎ∈𝐹𝛾
4𝜋2 𝑟2𝑤2
𝜉2+
𝑟2 + 𝑤2
𝜉4
−1ℎ 𝜉
2d𝜉
s.t. ℎ 𝑥𝑖 = 𝑦𝑖 for 𝑖 = 1,⋯ , 𝑛
low frequency preference
Case 1: 𝜉−2 dominant
• min 𝜉2 ℎ 𝜉2d𝜉~min ℎ′(𝑥) 2 d𝜉 → linear spline
Case 2: 𝜉−4 dominant
• min 𝜉4 ℎ 𝜉2d𝜉~min ℎ′′(𝑥) 2 d𝜉 → cubic spline
![Page 36: A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After Refereeing Papers for NIPS. Frequency Principle. Conventional view of generalization. Conventional](https://reader034.vdocuments.mx/reader034/viewer/2022042807/5f79eb33fbabea4865136203/html5/thumbnails/36.jpg)
Regularity can be changed through initialization
Case 1𝑟2 + 𝑤2 ≫ 4𝜋2 𝑟2𝑤2
Case 24𝜋2 𝑟2𝑤2 ≫ 𝑟2 + 𝑤2
minන𝜉2 ℎ 𝜉2d𝜉 minන𝜉4 ℎ 𝜉
2d𝜉
![Page 37: A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After Refereeing Papers for NIPS. Frequency Principle. Conventional view of generalization. Conventional](https://reader034.vdocuments.mx/reader034/viewer/2022042807/5f79eb33fbabea4865136203/html5/thumbnails/37.jpg)
𝜕𝑡 ℎ 𝜉, 𝑡 = −|𝑟|2 + 𝑤2
|𝜉|𝑑+3+4𝜋2 |𝑟|2𝑤2
|𝜉|𝑑+1ℎ𝑝 𝜉, 𝑡 − 𝑓𝑝 𝜉, 𝑡
where 𝑓: target function; ⋅ 𝑝 = (⋅)𝑝,where 𝑝 𝑥 =1
𝑛σ𝑖=1𝑛 𝛿(𝑥 − 𝑥𝑖); (⋅): Fourier
transform; 𝜉: frequency.
Theorem (informal). Solution of LFP dynamics at 𝑡 → ∞ with initial value
ℎini is the same as solution of the following optimization problem
minℎ−ℎini∈𝐹𝛾
න𝑟 2 + 𝑤2
|𝜉|𝑑+3+4𝜋2 𝑟 2𝑤2
|𝜉|𝑑+1
−1
ℎ 𝜉 − ℎini(𝜉)2d𝜉
s.t. ℎ 𝑋 = 𝑌.
High-dimensional Case
![Page 38: A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After Refereeing Papers for NIPS. Frequency Principle. Conventional view of generalization. Conventional](https://reader034.vdocuments.mx/reader034/viewer/2022042807/5f79eb33fbabea4865136203/html5/thumbnails/38.jpg)
FP-norm and FP-space
A priori generalization error boundTheorem (informal). Suppose that the real-valued target function 𝑓 ∈ 𝐹𝛾(Ω), ℎ𝑛 is
the solution of the regularized model
minℎ∈𝐹𝛾
ℎ 𝛾 s.t. ℎ 𝑋 = 𝑌
Then for any 𝛿 ∈ (0,1) with probability at least 1 − 𝛿 over the random training
samples, the population risk has the bound
𝐿 ℎ𝑛 ≤ 𝑓 ∞ + 2 𝑓 𝛾 𝛾 𝑙22
𝑛+ 4
2log(4/𝛿)
𝑛
We define the FP-norm for all function ℎ ∈ 𝐿2(Ω):
ℎ 𝛾: = ℎ𝐻Γ
=
𝑘∈ℤ𝑑∗
𝛾−2 𝑘 ℎ 𝑘2
1/2
Next, we define the FP-space:
𝐹𝛾 Ω = {ℎ ∈ 𝐿2(Ω): ℎ 𝛾 < ∞}
![Page 39: A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After Refereeing Papers for NIPS. Frequency Principle. Conventional view of generalization. Conventional](https://reader034.vdocuments.mx/reader034/viewer/2022042807/5f79eb33fbabea4865136203/html5/thumbnails/39.jpg)
Leo Breiman 1995
1. Why don’t heavily parameterized neural networks overfit the data?
2. What is the effective number of parameters?
3. Why doesn’t backpropagation head for a poor local minima?
4. When should one stop the backpropagation and use the current parameters?
Leo Breiman Reflections After Refereeing Papers for NIPS
![Page 40: A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After Refereeing Papers for NIPS. Frequency Principle. Conventional view of generalization. Conventional](https://reader034.vdocuments.mx/reader034/viewer/2022042807/5f79eb33fbabea4865136203/html5/thumbnails/40.jpg)
Global Minima
unbiased initialization + F-Principle
+ ⋯(to be discovered)
A picture for the generalization mystery of DNN
![Page 41: A Fourier Analysis Perspective of Training Dynamics of ... · Leo Breiman Reflections After Refereeing Papers for NIPS. Frequency Principle. Conventional view of generalization. Conventional](https://reader034.vdocuments.mx/reader034/viewer/2022042807/5f79eb33fbabea4865136203/html5/thumbnails/41.jpg)
Conclusion
DNNs prefer low frequencies!
References:
• Xu, Zhang, Xiao, Training behavior of deep neural network in frequency domain, 2018
• Xu, Zhang, Luo, Xiao, Ma, Frequency Principle: Fourier Analysis Sheds Light on Deep Neural Networks, 2019
• Zhang, Xu, Luo, Ma, Explicitizing an Implicit Bias of the Frequency Principle in Two-layer Neural Networks, 2019
• Zhang, Xu, Luo, Ma, A type of generalization error induced by initialization in deep neural networks, 2019
• Luo, Ma, Xu, Zhang, Theory on Frequency Principle in General Deep Neural Networks, 2019.