![Page 1: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/1.jpg)
Generalization in Deep Learning
Peter Bartlett Alexander Rakhlin
UC Berkeley
MIT
May 28, 2019
1 / 174
![Page 2: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/2.jpg)
Outline
Introduction
Uniform Laws of Large Numbers
Classification
Beyond ERM
Interpolation
2 / 174
![Page 3: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/3.jpg)
What is Generalization?
log(1 + 2 + 3) = log(1) + log(2) + log(3)
log(1 + 1.5 + 5) = log(1) + log(1.5) + log(5)
log(2 + 2) = log(2) + log(2)
log(1 + 1.25 + 9) = log(1) + log(1.25) + log(9)
3 / 174
![Page 4: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/4.jpg)
What is Generalization?
log(1 + 2 + 3) = log(1) + log(2) + log(3)
log(1 + 1.5 + 5) = log(1) + log(1.5) + log(5)
log(2 + 2) = log(2) + log(2)
log(1 + 1.25 + 9) = log(1) + log(1.25) + log(9)
3 / 174
![Page 5: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/5.jpg)
What is Generalization?
log(1 + 2 + 3) = log(1) + log(2) + log(3)
log(1 + 1.5 + 5) = log(1) + log(1.5) + log(5)
log(2 + 2) = log(2) + log(2)
log(1 + 1.25 + 9) = log(1) + log(1.25) + log(9)
3 / 174
![Page 6: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/6.jpg)
Outline
IntroductionSetupPerceptronBias-Variance Tradeoff
Uniform Laws of Large Numbers
Classification
Beyond ERM
Interpolation
4 / 174
![Page 7: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/7.jpg)
Statistical Learning Theory
Aim: Predict an outcome y from some set Y of possible outcomes, onthe basis of some observation x from a feature space X .
Use data set of n pairs
S = (x1, y1), . . . , (xn, yn)
to choose a function f : X → Y so that, for subsequent (x , y) pairs, f (x)is a good prediction of y .
5 / 174
![Page 8: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/8.jpg)
Formulations of Prediction Problems
To define the notion of a ‘good prediction,’ we can define a loss function
` : Y × Y → R.
`(y , y) is cost of predicting y when the outcome is y .
Aim: `(f (x), y) small.
6 / 174
![Page 9: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/9.jpg)
Formulations of Prediction Problems
ExampleIn pattern classification problems, the aim is to classify a pattern x intoone of a finite number of classes (that is, the label space Y is finite). Ifall mistakes are equally bad, we could define
`(y , y) = Iy 6= y =
1 if y 6= y ,
0 otherwise.
ExampleIn a regression problem, with Y = R, we might choose the quadratic lossfunction, `(y , y) = (y − y)2.
7 / 174
![Page 10: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/10.jpg)
Probabilistic Assumptions
Assume:
I There is a probability distribution P on X × Y,
I The pairs (X1,Y1), . . . , (Xn,Yn), (X ,Y ) are chosen independentlyaccording to P
The aim is to choose f with small risk:
L(f ) = E`(f (X ),Y ).
For instance, in the pattern classification example, this is themisclassification probability.
L(f ) = L01(f ) = EIf (X ) 6= Y = Pr(f (X ) 6= Y ).
8 / 174
![Page 11: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/11.jpg)
Why not estimate the underlying distribution P (or PY |X ) first?
This is in general a harder problem than prediction.
9 / 174
![Page 12: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/12.jpg)
Key difficulty: our goals are in terms of unknown quantities related tounknown P. Have to use empirical data instead. Purview of statistics.
For instance, we can calculate the empirical loss of f : X → Y
L(f ) =1
n
n∑i=1
`(Yi , f (Xi ))
10 / 174
![Page 13: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/13.jpg)
The function x 7→ fn(x) = fn(x ;X1,Y1, . . . ,Xn,Yn) is random, since itdepends on the random data S = (X1,Y1, . . . ,Xn,Yn). Thus, the risk
L(fn) = E[`(fn(X ),Y )|S
]= E
[`(fn(X ;X1,Y1, . . . ,Xn,Yn),Y )|S
]is a random variable. We might aim for EL(fn) small, or L(fn) small withhigh probability (over the training data).
11 / 174
![Page 14: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/14.jpg)
Quiz: what is random here?
1. L(f ) for a given fixed f
2. fn
3. L(fn)
4. L(fn)
5. L(f ) for a given fixed f
12 / 174
![Page 15: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/15.jpg)
Theoretical analysis of performance is typically easier if fn has closed form(in terms of the training data).
E.g. ordinary least squares fn(x) = x>(X>X )−1X>Y .
Unfortunately, most ML and many statistical procedures are not explicitlydefined but arise as
I solutions to an optimization objective (e.g. logistic regression)
I as an iterative procedure without an immediately obvious objectivefunction (e.g. AdaBoost, Random Forests, etc)
13 / 174
![Page 16: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/16.jpg)
The Gold Standard
Within the framework we set up, the smallest expected loss is achievedby the Bayes optimal function
f ∗ = arg minf
L(f )
where the minimization is over all (measurable) prediction rulesf : X → Y.
The value of the lowest expected loss is called the Bayes error:
L(f ∗) = inffL(f )
Of course, we cannot calculate any of these quantities since P isunknown.
14 / 174
![Page 17: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/17.jpg)
Bayes Optimal Function
Bayes optimal function f ∗ takes on the following forms in these twoparticular cases:
I Binary classification (Y = 0, 1) with the indicator loss:
f ∗(x) = Iη(x) ≥ 1/2, where η(x) = E[Y |X = x ]
0
1
(x)
I Regression (Y = R) with squared loss:
f ∗(x) = η(x), where η(x) = E[Y |X = x ]
15 / 174
![Page 18: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/18.jpg)
The big question: is there a way to construct a learning algorithm with aguarantee that
L(fn)− L(f ∗)
is small for large enough sample size n?
16 / 174
![Page 19: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/19.jpg)
Consistency
An algorithm that ensures
limn→∞
L(fn) = L(f ∗) almost surely
is called universally consistent. Consistency ensures that our algorithm isapproaching the best possible prediction performance as the sample sizeincreases.
The good news: consistency is possible to achieve.
I easy if X is a finite or countable set
I not too hard if X is infinite, and the underlying relationship betweenx and y is “continuous”
17 / 174
![Page 20: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/20.jpg)
The bad news...
In general, we cannot prove anything quantitative about L(fn)− L(f ∗),unless we make further assumptions (incorporate prior knowledge).
“No Free Lunch” Theorems: unless we posit assumptions,
I For any algorithm fn, any n and any ε > 0, there exists adistribution P such that L(f ∗) = 0 and
EL(fn) ≥ 1
2− ε
I For any algorithm fn, and any sequence an that converges to 0, thereexists a probability distribution P such that L(f ∗) = 0 and for all n
EL(fn) ≥ an
18 / 174
![Page 21: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/21.jpg)
is this really “bad news”?
Not really. We always have some domain knowledge.
Two ways of incorporating prior knowledge:
I Direct way: assumptions on distribution P (e.g. margin, smoothnessof regression function, etc)
I Indirect way: redefine the goal to perform as well as a reference setF of predictors:
L(fn)− inff∈F
L(f )
F encapsulates our inductive bias.
We often make both of these assumptions.
19 / 174
![Page 22: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/22.jpg)
Outline
IntroductionSetupPerceptronBias-Variance Tradeoff
Uniform Laws of Large Numbers
Classification
Beyond ERM
Interpolation
20 / 174
![Page 23: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/23.jpg)
Perceptron
<latexit sha1_base64="+03BdUB6TWqLjuCfwSU3OIMjF3A=">AAAB7HicbVDLSgNBEOz1GeMr6tHLYBA8hV0R1FvQi8cIbhJIljA7mU3GzGOZmRXCkn/w4kHFqx/kzb9xkuxBEwsaiqpuurvilDNjff/bW1ldW9/YLG2Vt3d29/YrB4dNozJNaEgUV7odY0M5kzS0zHLaTjXFIua0FY9up37riWrDlHyw45RGAg8kSxjB1knN7gALgXuVql/zZ0DLJChIFQo0epWvbl+RTFBpCcfGdAI/tVGOtWWE00m5mxmaYjLCA9pxVGJBTZTPrp2gU6f0UaK0K2nRTP09kWNhzFjErlNgOzSL3lT8z+tkNrmKcibTzFJJ5ouSjCOr0PR11GeaEsvHjmCimbsVkSHWmFgXUNmFECy+vEzC89p1Lbi/qNZvijRKcAwncAYBXEId7qABIRB4hGd4hTdPeS/eu/cxb13xipkj+APv8wfzVo7p</latexit><latexit sha1_base64="+03BdUB6TWqLjuCfwSU3OIMjF3A=">AAAB7HicbVDLSgNBEOz1GeMr6tHLYBA8hV0R1FvQi8cIbhJIljA7mU3GzGOZmRXCkn/w4kHFqx/kzb9xkuxBEwsaiqpuurvilDNjff/bW1ldW9/YLG2Vt3d29/YrB4dNozJNaEgUV7odY0M5kzS0zHLaTjXFIua0FY9up37riWrDlHyw45RGAg8kSxjB1knN7gALgXuVql/zZ0DLJChIFQo0epWvbl+RTFBpCcfGdAI/tVGOtWWE00m5mxmaYjLCA9pxVGJBTZTPrp2gU6f0UaK0K2nRTP09kWNhzFjErlNgOzSL3lT8z+tkNrmKcibTzFJJ5ouSjCOr0PR11GeaEsvHjmCimbsVkSHWmFgXUNmFECy+vEzC89p1Lbi/qNZvijRKcAwncAYBXEId7qABIRB4hGd4hTdPeS/eu/cxb13xipkj+APv8wfzVo7p</latexit><latexit sha1_base64="+03BdUB6TWqLjuCfwSU3OIMjF3A=">AAAB7HicbVDLSgNBEOz1GeMr6tHLYBA8hV0R1FvQi8cIbhJIljA7mU3GzGOZmRXCkn/w4kHFqx/kzb9xkuxBEwsaiqpuurvilDNjff/bW1ldW9/YLG2Vt3d29/YrB4dNozJNaEgUV7odY0M5kzS0zHLaTjXFIua0FY9up37riWrDlHyw45RGAg8kSxjB1knN7gALgXuVql/zZ0DLJChIFQo0epWvbl+RTFBpCcfGdAI/tVGOtWWE00m5mxmaYjLCA9pxVGJBTZTPrp2gU6f0UaK0K2nRTP09kWNhzFjErlNgOzSL3lT8z+tkNrmKcibTzFJJ5ouSjCOr0PR11GeaEsvHjmCimbsVkSHWmFgXUNmFECy+vEzC89p1Lbi/qNZvijRKcAwncAYBXEId7qABIRB4hGd4hTdPeS/eu/cxb13xipkj+APv8wfzVo7p</latexit><latexit sha1_base64="+03BdUB6TWqLjuCfwSU3OIMjF3A=">AAAB7HicbVDLSgNBEOz1GeMr6tHLYBA8hV0R1FvQi8cIbhJIljA7mU3GzGOZmRXCkn/w4kHFqx/kzb9xkuxBEwsaiqpuurvilDNjff/bW1ldW9/YLG2Vt3d29/YrB4dNozJNaEgUV7odY0M5kzS0zHLaTjXFIua0FY9up37riWrDlHyw45RGAg8kSxjB1knN7gALgXuVql/zZ0DLJChIFQo0epWvbl+RTFBpCcfGdAI/tVGOtWWE00m5mxmaYjLCA9pxVGJBTZTPrp2gU6f0UaK0K2nRTP09kWNhzFjErlNgOzSL3lT8z+tkNrmKcibTzFJJ5ouSjCOr0PR11GeaEsvHjmCimbsVkSHWmFgXUNmFECy+vEzC89p1Lbi/qNZvijRKcAwncAYBXEId7qABIRB4hGd4hTdPeS/eu/cxb13xipkj+APv8wfzVo7p</latexit>
21 / 174
![Page 24: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/24.jpg)
Perceptron
(x1, y1), . . . , (xT , yT ) ∈ X ×±1 (T may or may not be same as n)
Maintain a hypothesis wt ∈ Rd (initialize w1 = 0).
On round t,
I Consider (xt , yt)
I Form prediction yt = sign(〈wt , xt〉)I If yt 6= yt , update
wt+1 = wt + ytxt
elsewt+1 = wt
22 / 174
![Page 25: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/25.jpg)
Perceptron
wt<latexit sha1_base64="7J0RWCUqTAOGp/2VV5Bhw8gCVEA=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKMR6DXjxGNA9IljA7mU2GzM4uM71KCPkELx4U8eoXefNvnLxAjQUNRVU33V1BIoVB1/1yMiura+sb2c3c1vbO7l5+/6Bu4lQzXmOxjHUzoIZLoXgNBUreTDSnUSB5IxhcT/zGA9dGxOoehwn3I9pTIhSMopXuHjvYyRfcojsFcYveeal0USbeQlmQAsxR7eQ/292YpRFXyCQ1puW5CfojqlEwyce5dmp4QtmA9njLUkUjbvzR9NQxObFKl4SxtqWQTNWfEyMaGTOMAtsZUeybv95E/M9rpRhe+iOhkhS5YrNFYSoJxmTyN+kKzRnKoSWUaWFvJaxPNWVo08nZEJZeXib1s6JnI7o9L1Su5nFk4QiO4RQ8KEMFbqAKNWDQgyd4gVdHOs/Om/M+a80485lD+AXn4xuZmI3/</latexit><latexit sha1_base64="7J0RWCUqTAOGp/2VV5Bhw8gCVEA=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKMR6DXjxGNA9IljA7mU2GzM4uM71KCPkELx4U8eoXefNvnLxAjQUNRVU33V1BIoVB1/1yMiura+sb2c3c1vbO7l5+/6Bu4lQzXmOxjHUzoIZLoXgNBUreTDSnUSB5IxhcT/zGA9dGxOoehwn3I9pTIhSMopXuHjvYyRfcojsFcYveeal0USbeQlmQAsxR7eQ/292YpRFXyCQ1puW5CfojqlEwyce5dmp4QtmA9njLUkUjbvzR9NQxObFKl4SxtqWQTNWfEyMaGTOMAtsZUeybv95E/M9rpRhe+iOhkhS5YrNFYSoJxmTyN+kKzRnKoSWUaWFvJaxPNWVo08nZEJZeXib1s6JnI7o9L1Su5nFk4QiO4RQ8KEMFbqAKNWDQgyd4gVdHOs/Om/M+a80485lD+AXn4xuZmI3/</latexit><latexit sha1_base64="7J0RWCUqTAOGp/2VV5Bhw8gCVEA=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKMR6DXjxGNA9IljA7mU2GzM4uM71KCPkELx4U8eoXefNvnLxAjQUNRVU33V1BIoVB1/1yMiura+sb2c3c1vbO7l5+/6Bu4lQzXmOxjHUzoIZLoXgNBUreTDSnUSB5IxhcT/zGA9dGxOoehwn3I9pTIhSMopXuHjvYyRfcojsFcYveeal0USbeQlmQAsxR7eQ/292YpRFXyCQ1puW5CfojqlEwyce5dmp4QtmA9njLUkUjbvzR9NQxObFKl4SxtqWQTNWfEyMaGTOMAtsZUeybv95E/M9rpRhe+iOhkhS5YrNFYSoJxmTyN+kKzRnKoSWUaWFvJaxPNWVo08nZEJZeXib1s6JnI7o9L1Su5nFk4QiO4RQ8KEMFbqAKNWDQgyd4gVdHOs/Om/M+a80485lD+AXn4xuZmI3/</latexit><latexit sha1_base64="7J0RWCUqTAOGp/2VV5Bhw8gCVEA=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKMR6DXjxGNA9IljA7mU2GzM4uM71KCPkELx4U8eoXefNvnLxAjQUNRVU33V1BIoVB1/1yMiura+sb2c3c1vbO7l5+/6Bu4lQzXmOxjHUzoIZLoXgNBUreTDSnUSB5IxhcT/zGA9dGxOoehwn3I9pTIhSMopXuHjvYyRfcojsFcYveeal0USbeQlmQAsxR7eQ/292YpRFXyCQ1puW5CfojqlEwyce5dmp4QtmA9njLUkUjbvzR9NQxObFKl4SxtqWQTNWfEyMaGTOMAtsZUeybv95E/M9rpRhe+iOhkhS5YrNFYSoJxmTyN+kKzRnKoSWUaWFvJaxPNWVo08nZEJZeXib1s6JnI7o9L1Su5nFk4QiO4RQ8KEMFbqAKNWDQgyd4gVdHOs/Om/M+a80485lD+AXn4xuZmI3/</latexit>
xt<latexit sha1_base64="wK72DhZCMdU9mhIbOxvRhU8Wi14=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKMR6DXjxGNA9IljA7mU2GzM4uM71iCPkELx4U8eoXefNvnLxAjQUNRVU33V1BIoVB1/1yMiura+sb2c3c1vbO7l5+/6Bu4lQzXmOxjHUzoIZLoXgNBUreTDSnUSB5IxhcT/zGA9dGxOoehwn3I9pTIhSMopXuHjvYyRfcojsFcYveeal0USbeQlmQAsxR7eQ/292YpRFXyCQ1puW5CfojqlEwyce5dmp4QtmA9njLUkUjbvzR9NQxObFKl4SxtqWQTNWfEyMaGTOMAtsZUeybv95E/M9rpRhe+iOhkhS5YrNFYSoJxmTyN+kKzRnKoSWUaWFvJaxPNWVo08nZEJZeXib1s6JnI7o9L1Su5nFk4QiO4RQ8KEMFbqAKNWDQgyd4gVdHOs/Om/M+a80485lD+AXn4xubHo4A</latexit><latexit sha1_base64="wK72DhZCMdU9mhIbOxvRhU8Wi14=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKMR6DXjxGNA9IljA7mU2GzM4uM71iCPkELx4U8eoXefNvnLxAjQUNRVU33V1BIoVB1/1yMiura+sb2c3c1vbO7l5+/6Bu4lQzXmOxjHUzoIZLoXgNBUreTDSnUSB5IxhcT/zGA9dGxOoehwn3I9pTIhSMopXuHjvYyRfcojsFcYveeal0USbeQlmQAsxR7eQ/292YpRFXyCQ1puW5CfojqlEwyce5dmp4QtmA9njLUkUjbvzR9NQxObFKl4SxtqWQTNWfEyMaGTOMAtsZUeybv95E/M9rpRhe+iOhkhS5YrNFYSoJxmTyN+kKzRnKoSWUaWFvJaxPNWVo08nZEJZeXib1s6JnI7o9L1Su5nFk4QiO4RQ8KEMFbqAKNWDQgyd4gVdHOs/Om/M+a80485lD+AXn4xubHo4A</latexit><latexit sha1_base64="wK72DhZCMdU9mhIbOxvRhU8Wi14=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKMR6DXjxGNA9IljA7mU2GzM4uM71iCPkELx4U8eoXefNvnLxAjQUNRVU33V1BIoVB1/1yMiura+sb2c3c1vbO7l5+/6Bu4lQzXmOxjHUzoIZLoXgNBUreTDSnUSB5IxhcT/zGA9dGxOoehwn3I9pTIhSMopXuHjvYyRfcojsFcYveeal0USbeQlmQAsxR7eQ/292YpRFXyCQ1puW5CfojqlEwyce5dmp4QtmA9njLUkUjbvzR9NQxObFKl4SxtqWQTNWfEyMaGTOMAtsZUeybv95E/M9rpRhe+iOhkhS5YrNFYSoJxmTyN+kKzRnKoSWUaWFvJaxPNWVo08nZEJZeXib1s6JnI7o9L1Su5nFk4QiO4RQ8KEMFbqAKNWDQgyd4gVdHOs/Om/M+a80485lD+AXn4xubHo4A</latexit><latexit sha1_base64="wK72DhZCMdU9mhIbOxvRhU8Wi14=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKMR6DXjxGNA9IljA7mU2GzM4uM71iCPkELx4U8eoXefNvnLxAjQUNRVU33V1BIoVB1/1yMiura+sb2c3c1vbO7l5+/6Bu4lQzXmOxjHUzoIZLoXgNBUreTDSnUSB5IxhcT/zGA9dGxOoehwn3I9pTIhSMopXuHjvYyRfcojsFcYveeal0USbeQlmQAsxR7eQ/292YpRFXyCQ1puW5CfojqlEwyce5dmp4QtmA9njLUkUjbvzR9NQxObFKl4SxtqWQTNWfEyMaGTOMAtsZUeybv95E/M9rpRhe+iOhkhS5YrNFYSoJxmTyN+kKzRnKoSWUaWFvJaxPNWVo08nZEJZeXib1s6JnI7o9L1Su5nFk4QiO4RQ8KEMFbqAKNWDQgyd4gVdHOs/Om/M+a80485lD+AXn4xubHo4A</latexit>
23 / 174
![Page 26: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/26.jpg)
For simplicity, suppose all data are in a unit ball, ‖xt‖ ≤ 1.
Margin with respect to (x1, y1), . . . , (xT , yT ):
γ = max‖w‖=1
mini∈[T ]
(yi 〈w , xi 〉)+ ,
where (a)+ = max0, a.
Perceptron makes at most 1/γ2 mistakes (and corrections) on anysequence of examples with margin γ.
Theorem (Novikoff ’62).
24 / 174
![Page 27: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/27.jpg)
Proof: Let m be the number of mistakes after T iterations. If a mistakeis made on round t,
‖wt+1‖2 = ‖wt + ytxt‖2 ≤ ‖wt‖2 + 2yt 〈wt , xt〉+ 1 ≤ ‖wt‖2 + 1.
Hence,‖wT‖2 ≤ m.
For optimal hyperplane w∗
γ ≤ 〈w∗, ytxt〉 = 〈w∗,wt+1 − wt〉 .
Hence (adding and canceling),
mγ ≤ 〈w∗,wT 〉 ≤ ‖wT‖ ≤√m.
25 / 174
![Page 28: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/28.jpg)
Recap
For any T and (x1, y1), . . . , (xT , yT ),
T∑t=1
Iyt 〈wt , xt〉 ≤ 0 ≤ D2
γ2
where γ = γ(x1:T , y1:T ) is margin and D = D(x1:T , y1:T ) = maxt ‖xt‖.
Let w∗ denote the max margin hyperplane, ‖w∗‖ = 1.
26 / 174
![Page 29: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/29.jpg)
Consequence for i.i.d. data (I)
Do one pass on i.i.d. sequence (X1,Y1), . . . , (Xn,Yn) (i.e. T = n).
Claim: expected indicator loss of randomly picked function x 7→ 〈wτ , x〉(τ ∼ unif(1, . . . , n)) is at most
1
n× E
[D2
γ2
].
27 / 174
![Page 30: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/30.jpg)
Proof: Take expectation on both sides:
E
[1
n
n∑t=1
IYt 〈wt ,Xt〉 ≤ 0]≤ E
[D2
nγ2
]Left-hand side can be written as (recall notation S = (Xi ,Yi )
ni=1)
EτES IYτ 〈wτ ,Xτ 〉 ≤ 0
Since wτ is a function of X1:τ−1,Y1:τ−1, above is
ESEτEX ,Y IY 〈wτ ,X 〉 ≤ 0 = ESEτL01(wτ )
Claim follows.
NB: To ensure E[D2/γ2] is not infinite, we assume margin γ in P.
28 / 174
![Page 31: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/31.jpg)
Consequence for i.i.d. data (II)
Now, rather than doing one pass, cycle through data
(X1,Y1), . . . , (Xn,Yn)
repeatedly until no more mistakes (i.e. T ≤ n × (D2/γ2)).
Then final hyperplane of Perceptron separates the data perfectly, i.e.finds minimum L01(wT ) = 0 for
L01(w) =1
n
n∑i=1
IYi 〈w ,Xi 〉 ≤ 0.
Empirical Risk Minimization with finite-time convergence. Can we sayanything about future performance of wT ?
29 / 174
![Page 32: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/32.jpg)
Consequence for i.i.d. data (II)
Claim:
EL01(wT ) ≤ 1
n + 1× E
[D2
γ2
]
30 / 174
![Page 33: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/33.jpg)
Proof: Shortcuts: z = (x , y) and `(w , z) = Iy 〈w , x〉 ≤ 0. Then
ESEZ `(wT ,Z ) = ES,Zn+1
[1
n + 1
n+1∑t=1
`(w (−t),Zt)
]
where w (−t) is Perceptron’s final hyperplane after cycling through dataZ1, . . . ,Zt−1,Zt+1, . . . ,Zn+1.
That is, leave-one-out is unbiased estimate of expected loss.
Now consider cycling Perceptron on Z1, . . . ,Zn+1 until no more errors.Let i1, . . . , im be indices on which Perceptron errs in any of the cycles.We know m ≤ D2/γ2. However, if index t /∈ i1, . . . , im, then whetheror not Zt was included does not matter, and Zt is correctly classified byw (−t). Claim follows.
31 / 174
![Page 34: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/34.jpg)
A few remarks
I Bayes error L01(f∗) = 0 since we assume margin in the distribution
P (and, hence, in the data).
I Our assumption on P is not about its parametric or nonparametricform, but rather on what happens at the boundary.
I Curiously, we beat the CLT rate of 1/√n.
32 / 174
![Page 35: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/35.jpg)
Overparametrization and overfittingThe last lecture will discuss overparametrization and overfitting.
I dimension d never appears in the mistake bound of Perceptron.Hence, we can even take d =∞.
I Perceptron is 1-layer neural network (no hidden layer) with dneurons.
I For d > n, any n points in general position can be separated (zeroempirical error).
Conclusions:
I More parameters than data does not necessarily contradict goodprediction performance (“generalization”).
I Perfectly fitting the data does not necessarily contradict goodgeneralization (particularly with no noise).
I Complexity is subtle (e.g. number of parameters vs margin)
I ... however, margin is a strong assumption (more than just no noise)
33 / 174
![Page 36: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/36.jpg)
Outline
IntroductionSetupPerceptronBias-Variance Tradeoff
Uniform Laws of Large Numbers
Classification
Beyond ERM
Interpolation
34 / 174
![Page 37: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/37.jpg)
Relaxing the assumptions
We provedEL01(wT )− L01(f
∗) ≤ O(1/n)
under the assumption that P is linearly separable with margin γ.
The assumption on the probability distribution implies that the Bayesclassifier is a linear separator (“realizable” setup).
We may relax the margin assumption, yet still hope to do well with linearseparators. That is, we would aim to minimize
EL01(wT )− L01(w∗)
hoping thatL01(w
∗)− L01(f∗)
is small, where w∗ is the best hyperplane classifier with respect to L01.
35 / 174
![Page 38: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/38.jpg)
More generally, we will work with some class of functions F that we hopecaptures well the relationship between X and Y . The choice of F givesrise to a bias-variance decomposition.
Let fF ∈ argminf∈F
L(f ) be the function in F with smallest expected loss.
36 / 174
![Page 39: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/39.jpg)
Bias-Variance Tradeoff
L(fn)− L(f ∗) = L(fn)− inff∈F
L(f )︸ ︷︷ ︸Estimation Error
+ inff∈F
L(f )− L(f ∗)︸ ︷︷ ︸Approximation Error
F
fn ffF
Clearly, the two terms are at odds with each other:
I Making F larger means smaller approximation error but (as we willsee) larger estimation error
I Taking a larger sample n means smaller estimation error and has noeffect on the approximation error.
I Thus, it makes sense to trade off size of F and n (Structural RiskMinimization, or Method of Sieves, or Model Selection).
37 / 174
![Page 40: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/40.jpg)
Bias-Variance Tradeoff
In simple problems (e.g. linearly separable data) we might get away withhaving no bias-variance decomposition (e.g. as was done in thePerceptron case).
However, for more complex problems, our assumptions on P often implythat class containing f ∗ is too large and the estimation error cannot becontrolled. In such cases, we need to produce a “biased” solution inhopes of reducing variance.
Finally, the bias-variance tradeoff need not be in the form we justpresented. We will consider a different decomposition in the last lecture.
38 / 174
![Page 41: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/41.jpg)
Outline
Introduction
Uniform Laws of Large NumbersMotivationRademacher AveragesRademacher Complexity: Structural ResultsSigmoid NetworksReLU NetworksCovering Numbers
Classification
Beyond ERM
Interpolation
39 / 174
![Page 42: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/42.jpg)
Consider the performance of empirical risk minimization:
Choose fn ∈ F to minimize L(f ), where L is the empirical risk,
L(f ) = Pn`(f (X ),Y ) =1
n
n∑i=1
`(f (Xi ),Yi ).
For pattern classification, this is the proportion of training examplesmisclassified.
How does the excess risk, L(fn)− L(fF ) behave? We can write
L(fn)− L(fF ) =[L(fn)− L(fn)
]+[L(fn)− L(fF )
]+[L(fF )− L(fF )
]Therefore,
EL(fn)− L(fF ) ≤ E[L(fn)− L(fn)
]because second term is nonpositive and third is zero in expectation.
40 / 174
![Page 43: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/43.jpg)
The term L(fn)− L(fn) is not necessarily zero in expectation (check!). Aneasy upper bound is
L(fn)− L(fn) ≤ supf∈F
∣∣∣L(f )− L(f )∣∣∣ ,
and this motivates the study of uniform laws of large numbers.
Roadmap: study the sup to
I understand when learning is possible,
I understand implications for sample complexity,
I design new algorithms (regularizer arises from upper bound on sup)
41 / 174
![Page 44: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/44.jpg)
Loss Class
In what follows, we will work with a function class G on Z, and for thelearning application, G = ` F :
G = (x , y) 7→ `(f (x), y) : f ∈ F
and Z = X × Y.
42 / 174
![Page 45: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/45.jpg)
Empirical Process Viewpoint
g0
Eg
all functions
g0
Eg
all functions
1
n
nX
i=1
g(zi)
g0
Eg
all functions
1
n
nX
i=1
g(zi)
g0
Eg
all functions
1
n
nX
i=1
g(zi)
gn g0
1
n
nX
i=1
g(zi)
gn
G
g0
Eg
all functions
1
n
nX
i=1
g(zi)G
g0
Eg
all functions
1
n
nX
i=1
g(zi)
gG gn
G
g0
Eg
all functions
1
n
nX
i=1
g(zi)
43 / 174
![Page 46: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/46.jpg)
Empirical Process Viewpoint
g0
Eg
all functions
g0
Eg
all functions
1
n
nX
i=1
g(zi)
g0
Eg
all functions
1
n
nX
i=1
g(zi)
g0
Eg
all functions
1
n
nX
i=1
g(zi)
gn g0
1
n
nX
i=1
g(zi)
gn
G
g0
Eg
all functions
1
n
nX
i=1
g(zi)G
g0
Eg
all functions
1
n
nX
i=1
g(zi)
gG gn
G
g0
Eg
all functions
1
n
nX
i=1
g(zi)
43 / 174
![Page 47: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/47.jpg)
Empirical Process Viewpoint
g0
Eg
all functionsg0
Eg
all functions
1
n
nX
i=1
g(zi)
g0
Eg
all functions
1
n
nX
i=1
g(zi)
g0
Eg
all functions
1
n
nX
i=1
g(zi)
gn g0
1
n
nX
i=1
g(zi)
gn
G
g0
Eg
all functions
1
n
nX
i=1
g(zi)G
g0
Eg
all functions
1
n
nX
i=1
g(zi)
gG gn
G
g0
Eg
all functions
1
n
nX
i=1
g(zi)
43 / 174
![Page 48: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/48.jpg)
Empirical Process Viewpoint
g0
Eg
all functionsg0
Eg
all functions
1
n
nX
i=1
g(zi)
g0
Eg
all functions
1
n
nX
i=1
g(zi)
g0
Eg
all functions
1
n
nX
i=1
g(zi)
gn
g0
1
n
nX
i=1
g(zi)
gn
G
g0
Eg
all functions
1
n
nX
i=1
g(zi)G
g0
Eg
all functions
1
n
nX
i=1
g(zi)
gG gn
G
g0
Eg
all functions
1
n
nX
i=1
g(zi)
43 / 174
![Page 49: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/49.jpg)
Empirical Process Viewpoint
g0
Eg
all functionsg0
Eg
all functions
1
n
nX
i=1
g(zi)
g0
Eg
all functions
1
n
nX
i=1
g(zi)
g0
Eg
all functions
1
n
nX
i=1
g(zi)
gn
g0
1
n
nX
i=1
g(zi)
gn
G
g0
Eg
all functions
1
n
nX
i=1
g(zi)G
g0
Eg
all functions
1
n
nX
i=1
g(zi)
gG gn
G
g0
Eg
all functions
1
n
nX
i=1
g(zi)
43 / 174
![Page 50: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/50.jpg)
Empirical Process Viewpoint
g0
Eg
all functionsg0
Eg
all functions
1
n
nX
i=1
g(zi)
g0
Eg
all functions
1
n
nX
i=1
g(zi)
g0
Eg
all functions
1
n
nX
i=1
g(zi)
gn g0
1
n
nX
i=1
g(zi)
gn
G
g0
Eg
all functions
1
n
nX
i=1
g(zi)
G
g0
Eg
all functions
1
n
nX
i=1
g(zi)
gG gn
G
g0
Eg
all functions
1
n
nX
i=1
g(zi)
43 / 174
![Page 51: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/51.jpg)
Empirical Process Viewpoint
g0
Eg
all functionsg0
Eg
all functions
1
n
nX
i=1
g(zi)
g0
Eg
all functions
1
n
nX
i=1
g(zi)
g0
Eg
all functions
1
n
nX
i=1
g(zi)
gn g0
1
n
nX
i=1
g(zi)
gn
G
g0
Eg
all functions
1
n
nX
i=1
g(zi)
G
g0
Eg
all functions
1
n
nX
i=1
g(zi)
gG gn
G
g0
Eg
all functions
1
n
nX
i=1
g(zi)
43 / 174
![Page 52: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/52.jpg)
Empirical Process Viewpoint
g0
Eg
all functionsg0
Eg
all functions
1
n
nX
i=1
g(zi)
g0
Eg
all functions
1
n
nX
i=1
g(zi)
g0
Eg
all functions
1
n
nX
i=1
g(zi)
gn g0
1
n
nX
i=1
g(zi)
gn
G
g0
Eg
all functions
1
n
nX
i=1
g(zi)G
g0
Eg
all functions
1
n
nX
i=1
g(zi)
gG gn
G
g0
Eg
all functions
1
n
nX
i=1
g(zi)
43 / 174
![Page 53: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/53.jpg)
Uniform Laws of Large Numbers
For a class G of functions g : Z → [0, 1], suppose that Z1, . . . ,Zn,Z arei.i.d. on Z, and consider
U = supg∈G
∣∣∣∣∣Eg(Z )− 1
n
n∑i=1
g(Zi )
∣∣∣∣∣ = supg∈G|Pg − Png | =: ‖P − Pn︸ ︷︷ ︸
emp proc
‖G .
If U converges to 0, this is called a uniform law of large numbers.
44 / 174
![Page 54: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/54.jpg)
Glivenko-Cantelli Classes
G is a Glivenko-Cantelli class for P if ‖Pn − P‖G P→ 0.
Definition.
I P is a distribution on Z,
I Z1, . . . ,Zn are drawn i.i.d. from P,
I Pn is the empirical distribution (mass 1/n at each of Z1, . . . ,Zn),
I G is a set of measurable real-valued functions on Z with finiteexpectation under P,
I Pn − P is an empirical process, that is, a stochastic processindexed by a class of functions G, and
I ‖Pn − P‖G := supg∈G |Png − Pg |.
45 / 174
![Page 55: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/55.jpg)
Glivenko-Cantelli ClassesWhy ‘Glivenko-Cantelli’ ? An example of a uniform law of large numbers.
‖Fn − F‖∞as→ 0.
Theorem (Glivenko-Cantelli).
Here, F is a cumulative distribution function, Fn is the empiricalcumulative distribution function,
Fn(t) =1
n
n∑i=1
IZi ≤ t,
where Z1, . . . ,Zn are i.i.d. with distribution F , and‖F − G‖∞ = supt |F (t)− G (t)|.
‖Pn − P‖G as→ 0, for G = z 7→ Iz ≤ θ : θ ∈ R.
Theorem (Glivenko-Cantelli).
46 / 174
![Page 56: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/56.jpg)
Glivenko-Cantelli Classes
Not all G are Glivenko-Cantelli classes. For instance,
G = Iz ∈ S : S ⊂ R, |S | <∞ .
Then for a continuous distribution P, Pg = 0 for any g ∈ G, butsupg∈G Png = 1 for all n. So although Png
as→ Pg for all g ∈ G, thisconvergence is not uniform over G. G is too large.
In general, how do we decide whether G is “too large”?
47 / 174
![Page 57: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/57.jpg)
Outline
Introduction
Uniform Laws of Large NumbersMotivationRademacher AveragesRademacher Complexity: Structural ResultsSigmoid NetworksReLU NetworksCovering Numbers
Classification
Beyond ERM
Interpolation
48 / 174
![Page 58: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/58.jpg)
Rademacher AveragesGiven a set H ⊆ [−1, 1]n, measure its “size” by average length ofprojection onto random direction.
Rademacher process indexed by vector h ∈ H:
Rn(h) =1
n〈ε, h〉
where ε = (ε1, . . . , εn) is vector of i.i.d. Rademacher random variables(symmetric ±1).
Rademacher averages: expected maximum of the process
Eε∥∥∥Rn
∥∥∥H
= E suph∈H
∣∣∣∣1n 〈ε, h〉∣∣∣∣
49 / 174
![Page 59: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/59.jpg)
Examples
If H = h0,E∥∥∥Rn
∥∥∥H
= O(n−1/2).
If H = −1, 1n,
E∥∥∥Rn
∥∥∥H
= 1.
If v + c−1, 1n ⊆ H, then
E∥∥∥Rn
∥∥∥H≥ c − o(1)
How about H = −1, 1 where 1 is a vector of 1’s?
50 / 174
![Page 60: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/60.jpg)
Examples
If H is finite,
E∥∥∥Rn
∥∥∥H≤ c
√log |H|
n
for some constant c .
This bound can be loose, as it does not take into account“overlaps”/correlations between vectors.
51 / 174
![Page 61: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/61.jpg)
Examples
Let Bnp be a unit ball in Rn:
Bnp = x ∈ Rn : ‖x‖p ≤ 1, ‖x‖p =
(n∑
i=1
|xi |p)1/p
.
For H = Bn2
1
nE max‖g‖2≤1
| 〈ε, g〉 | =1
nE ‖ε‖2 =
1√n
On the other hand, Rademacher averages for Bn1 are 1
n .
52 / 174
![Page 62: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/62.jpg)
Rademacher Averages
What do these Rademacher averages have to do with our problem ofbounding uniform deviations?
Suppose we take
H = G(Z n1 ) = (g(Z1), . . . , g(Zn)) : g ∈ G ⊂ Rn
g1<latexit sha1_base64="upU6x31J8AFJsV7QHUFUq5QBYAA=">AAAJ2HicjVZLb+M2ENbu9hG7r30cexEaBNgCQmClTWIfCgS72bYBtkm6STbbtY2AoiiZMCkKJBVHFQT0UKDooZce2p/T39Fbf0qHkh1btIJWkOnBzDfDmW+GhIKUUaV7vb/v3X/wzrvvvb/R6X7w4Ucff/Lw0ePXSmQSkwssmJBvAqQIowm50FQz8iaVBPGAkctg+tzYL6+JVFQk5zpPyZijOKERxUiD6iy+8q8ebva2e72e7/uuEfz9vR4Ig0F/x++7vjHBs3nwZOvnfxzHOb16tPHXKBQ44yTRmCGlhn4v1eMCSU0xI2V3lCmSIjxFMRmCmCBO1Lioci3dLdCEbiQk/BLtVtpVjwJxpXIeAJIjPVG2zSjbbMNMR/1xQZM00yTB9UZRxlwtXFO4G1JJsGY5CAhLCrm6eIIkwhroaewyAbiUJGpWgrmmqYcYGxc3ecNUaDr9samB/jBLo0EgSQx9qg3GidFAIpkDdVLMlKcmKCWq7G5Z/CoiaeRVJVdSKkVElOkpYoZFb4KSUGTaqxsRQP+JhDDucn/lPo0ECz9fi/3fUVcCrvJx7o8LAzNkNyzQ6QRLopuTUNzwlGdM06aWZIzIa25FQCwW0KIJ924lipvtSITkiBE+LiACbwZN42bviiDg/zt7TbnpQaMglGDCSrehNMRJFVnQmURpRGML+11yBiMtmiMxjCTQyomeiPCrkEQI2BkXPKzUYdk2+N78cHgMaXIDklEAOU1wDDlMKL5p7iaBp9C7NhdFOF4yvEPu5N69k3wUKMHgnHkC7haG8nEB6ehUABkQjIAyN+kWo+p4FwHLSFkuLCFVKfhYRnB0n5+8PHnlHr74+uj46Pzo5PisOwJeIOEaeYjk9BtJSFIWMg7Korft73pwVe2Zxd81nK/CX9J4on8gjInZrUPfQKtl0I6H+Pktet8A68UGz3NZYqtM6qU18DNT5gI82K1TqNZBK95y8Ctove61OryCmbkts+ZlvtrwZwzauMACBF4bckZSisoC83xqMBDxC8+Hv/1eacWSFE+rrRdYd7s/gGXwJSw7fQt+OaF6URUUY14LcZrJlJGVhkFrPQNKyAwLzuFOKkazKswQzvDIDF7tWSuLTb+00NWEWeBK14KVUIoFNaoWpJAoidfizrUt+LiaXAu+MtJteZs+WR7z5q2jmRmClvSXw7Huo6o2Ww7z3rfkY3rdssNyBlqrzlszqg/PukMIjLQ5LQ/cuk9az4zlsZikCr/lvjg+rC+Ysy588iy+a9y7hdc72z7I3/ubB32nfjacT53PnKeO7+w7B863zqlz4WAndn5z/nD+7Lzt/NT5pfNrDb1/b+7zxGk8nd//BQ28jZo=</latexit><latexit sha1_base64="sXoBpG6JmbUzgx55nt8lBNfWW3Y=">AAAJ2HicjVZfb+NEEPcd/5rw74575MWiqnRIVmUX2iYPSNVdD6h0tOXaXg+SqFqv184qu15rd900WJZ4QEI88MIDfBze+A7wAfgEfABm7aSJN67Acjajmd/MzvxmduUwY1Rp3//z3v3XXn/jzbc2Ot2333n3vfcfPPzgpRK5xOQCCybkqxApwmhKLjTVjLzKJEE8ZOQynDw19strIhUV6bmeZWTEUZLSmGKkQXWWXAVXDzb9bd/3gyBwjRDs7/kg9Pu9naDnBsYEz+bBo60f/vnj779Orx5u/D6MBM45STVmSKlB4Gd6VCCpKWak7A5zRTKEJyghAxBTxIkaFVWupbsFmsiNhYRfqt1Ku+pRIK7UjIeA5EiPlW0zyjbbINdxb1TQNMs1SXG9UZwzVwvXFO5GVBKs2QwEhCWFXF08RhJhDfQ0dhkDXEoSNyvBXNPMQ4yNiptZw1RoOvmuqYH+MEujQSBpAn2qDcaJ0VAiOQPqpJgqT41RRlTZ3bL4VUTS2KtKrqRMipgo01PEDIveGKWRyLVXNyKE/hMJYdzl/sp9HAsWfbwW+7+jrgRc5eM8GBUGZshuWKDTKZZENyehuOEZz5mmTS3JGZHX3IqAWCKgRWPu3UoUN9uRCskRI3xUQATeDJolzd4VYcj/d/aactODRkEoxYSVbkNpiJMqtqBTibKYJhb2q/QMRlo0R2IQS6CVEz0W0WcRiRGwMyp4VKmjsm3wvfnh8BjS5AYkowBymuAEchhTfNPcTQJPkXdtLopotGR4h9zJvXsn+ShUgsE58wTcLQzNRgWkozMBZEAwAsqZSbcYVse7CFlOynJhiajKwMcygqP79OT5yQv38NnnR8dH50cnx2fdIfACCdfIQyQnX0hC0rKQSVgW/naw68FVtWeWYNdwvgp/TpOx/oYwJqa3Dj0DrZZ+Ox7iz27R+wZYLzZ4nssSW2VSL62Bn5gyF+D+bp1CtfZb8ZZDUEHrda/V4QXMzG2ZNS/z1YY/YdDGBRYg8NqQM5JRVBaYzyYGAxE/8QL42/dLK5akeFJtvcC6270+LP1PYdnpWfDLMdWLqqAY81qI01xmjKw0DFrrGVBKplhwDndSMZxWYQZwhodm8GrPWllsBqWFribMAle6FqyEUiyoUbUghURpshZ3rm3BJ9XkWvCVkW7L2/TJ8pg3bx3NzBC0pL8cjnUfVbXZcpj3viUf0+uWHZYz0Fr1rDWj+vCsO0TASJvT8sCt+2T1zFgei0mq8Fvus+PD+oI568Inz+K7xr1beLmzHYD8dbB50HPqZ8P50PnIeewEzr5z4HzpnDoXDnYS52fnV+e3zred7zs/dn6qoffvzX0eOY2n88u/KiCQNA==</latexit><latexit sha1_base64="sXoBpG6JmbUzgx55nt8lBNfWW3Y=">AAAJ2HicjVZfb+NEEPcd/5rw74575MWiqnRIVmUX2iYPSNVdD6h0tOXaXg+SqFqv184qu15rd900WJZ4QEI88MIDfBze+A7wAfgEfABm7aSJN67Acjajmd/MzvxmduUwY1Rp3//z3v3XXn/jzbc2Ot2333n3vfcfPPzgpRK5xOQCCybkqxApwmhKLjTVjLzKJEE8ZOQynDw19strIhUV6bmeZWTEUZLSmGKkQXWWXAVXDzb9bd/3gyBwjRDs7/kg9Pu9naDnBsYEz+bBo60f/vnj779Orx5u/D6MBM45STVmSKlB4Gd6VCCpKWak7A5zRTKEJyghAxBTxIkaFVWupbsFmsiNhYRfqt1Ku+pRIK7UjIeA5EiPlW0zyjbbINdxb1TQNMs1SXG9UZwzVwvXFO5GVBKs2QwEhCWFXF08RhJhDfQ0dhkDXEoSNyvBXNPMQ4yNiptZw1RoOvmuqYH+MEujQSBpAn2qDcaJ0VAiOQPqpJgqT41RRlTZ3bL4VUTS2KtKrqRMipgo01PEDIveGKWRyLVXNyKE/hMJYdzl/sp9HAsWfbwW+7+jrgRc5eM8GBUGZshuWKDTKZZENyehuOEZz5mmTS3JGZHX3IqAWCKgRWPu3UoUN9uRCskRI3xUQATeDJolzd4VYcj/d/aactODRkEoxYSVbkNpiJMqtqBTibKYJhb2q/QMRlo0R2IQS6CVEz0W0WcRiRGwMyp4VKmjsm3wvfnh8BjS5AYkowBymuAEchhTfNPcTQJPkXdtLopotGR4h9zJvXsn+ShUgsE58wTcLQzNRgWkozMBZEAwAsqZSbcYVse7CFlOynJhiajKwMcygqP79OT5yQv38NnnR8dH50cnx2fdIfACCdfIQyQnX0hC0rKQSVgW/naw68FVtWeWYNdwvgp/TpOx/oYwJqa3Dj0DrZZ+Ox7iz27R+wZYLzZ4nssSW2VSL62Bn5gyF+D+bp1CtfZb8ZZDUEHrda/V4QXMzG2ZNS/z1YY/YdDGBRYg8NqQM5JRVBaYzyYGAxE/8QL42/dLK5akeFJtvcC6270+LP1PYdnpWfDLMdWLqqAY81qI01xmjKw0DFrrGVBKplhwDndSMZxWYQZwhodm8GrPWllsBqWFribMAle6FqyEUiyoUbUghURpshZ3rm3BJ9XkWvCVkW7L2/TJ8pg3bx3NzBC0pL8cjnUfVbXZcpj3viUf0+uWHZYz0Fr1rDWj+vCsO0TASJvT8sCt+2T1zFgei0mq8Fvus+PD+oI568Inz+K7xr1beLmzHYD8dbB50HPqZ8P50PnIeewEzr5z4HzpnDoXDnYS52fnV+e3zred7zs/dn6qoffvzX0eOY2n88u/KiCQNA==</latexit><latexit sha1_base64="IKa0cr9W32dKAhXNEcFAYkW1opU=">AAAJ2HicjVbNbuM2ENZu/2L3b7c99iI0CLAFhMBKm8Q+FFjsZtsG2CbpJtlsaxsBRVESYVIUSCqOKgjoreihlx7ax+lz9G06lOzYohW0gkwPZr4ZznwzJBRkjCo9GPzz4OFbb7/z7ntbvf77H3z40cePHn/yWolcYnKJBRPyTYAUYTQll5pqRt5kkiAeMHIVzJ4b+9UNkYqK9EIXGZlyFKc0ohhpUJ3H1/71o+3B7mAw8H3fNYJ/eDAAYTQa7vlD1zcmeLadxXN2/Xjr70kocM5JqjFDSo39QaanJZKaYkaq/iRXJEN4hmIyBjFFnKhpWedauTugCd1ISPil2q216x4l4koVPAAkRzpRts0ou2zjXEfDaUnTLNckxc1GUc5cLVxTuBtSSbBmBQgISwq5ujhBEmEN9LR2SQAuJYnalWCuaeYhxqblbdEylZrOfm5roD/M0mgQSBpDnxqDcWI0kEgWQJ0Uc+WpBGVEVf0di19FJI28uuRayqSIiDI9Rcyw6CUoDUWuvaYRAfSfSAjjrvZX7pNIsPCLjdj/HXUt4DofF/60NDBDdssCnU6xJLo9CeUtz3jONG1rSc6IvOFWBMRiAS1KuHcnUdxuRyokR4zwaQkReDtoFrd7VwYB/9/Za8pND1oFoRQTVrktpSFOqsiCziXKIhpb2O/Tcxhp0R6JcSSBVk50IsKvQxIhYGda8rBWh1XX4HuLw+ExpMktSEYB5LTBMeSQUHzb3k0CT6F3Yy6KcLpieI/cy717L/koUILBOfME3C0MFdMS0tGZADIgGAFlYdItJ/XxLgOWk6paWkKqMvCxjODoPj99efrKPXrxzfHJ8cXx6cl5fwK8QMIN8gjJ2beSkLQqZRxU5WDX3/fgqjowi79vOF+Hv6Rxon8kjIn5ncPQQOtl1I2H+MUd+tAAm8UGL3JZYetMmqUz8DNT5hI82m9SqNdRJ95y8Gtosx50OryCmbkrs+FlsdrwZwzauMQCBF4bck4yiqoS82JmMBDxS8+Hv8NBZcWSFM/qrZdYd3c4gmX0FSx7Qwt+lVC9rAqKMa+FOMtlxshaw6C1ngGlZI4F53AnlZN5HWYMZ3hiBq/xbJTltl9Z6HrCLHCt68BKKMWCGlUHUkiUxhtxF9oOfFxPrgVfG+muvE2fLI9F8zbRzAxBR/qr4dj0UXWbLYdF7zvyMb3u2GE1A51VF50ZNYdn0yEERrqcVgdu0ydrZsbyWE5Sjd9xX5wcNRfMeR8+eZbfNe79wuu9XR/kH/ztp8PFx8+W85nzufPE8Z1D56nznXPmXDrYiZ3fnT+dv3o/9X7p/dr7rYE+fLDw+dRpPb0//gXMf4u+</latexit>
g2<latexit sha1_base64="647aWPtTo779AnQGQnYNZ516kjI=">AAAJ2HicjVZLb+M2ENbu9hG7r30cexEaBNgCQmC5TWIfCgS72bYBtkm6TjbbtY2AoiiZMCkKJBXHFQT0UKDooZce2p/T39Fbf0qHkh1btIJWkOnBzDfDmW+GhIKUUaU7nb/v3X/wzrvvvb/Van/w4Ucff/Lw0ePXSmQSkwssmJBvAqQIowm50FQz8iaVBPGAkctg+tzYL6+JVFQk53qekjFHcUIjipEG1SC+6l493O7sdjod3/ddI/gH+x0Q+v1e1++5vjHBs334ZOfnfxzHObt6tPXXKBQ44yTRmCGlhn4n1eMcSU0xI0V7lCmSIjxFMRmCmCBO1Dgvcy3cHdCEbiQk/BLtltp1jxxxpeY8ACRHeqJsm1E22YaZjnrjnCZppkmCq42ijLlauKZwN6SSYM3mICAsKeTq4gmSCGugp7bLBOBSkqheCeaaph5ibJzfzGumXNPpj3UN9IdZGg0CSWLoU2UwTowGEsk5UCfFTHlqglKiivaOxa8ikkZeWXIppVJERJmeImZY9CYoCUWmvaoRAfSfSAjjrvZX7tNIsPDzjdj/HXUt4Dof5/44NzBDds0CnU6wJLo+CfkNT3nGNK1rScaIvOZWBMRiAS2acO9WorjejkRIjhjh4xwi8HrQNK73Lg8C/r+z15SbHtQKQgkmrHBrSkOcVJEFnUmURjS2sN8lAxhpUR+JYSSBVk70RIRfhSRCwM4452GpDoumwfcWh8NjSJMbkIwCyKmDY8hhQvFNfTcJPIXetbkowvGK4S65k3v3TvJRoASDc+YJuFsYmo9zSEenAsiAYASUc5NuPiqPdx6wjBTF0hJSlYKPZQRH9/npy9NX7tGLr49Pjs+PT08G7RHwAglXyCMkp99IQpIil3FQ5J1df8+Dq2rfLP6e4Xwd/pLGE/0DYUzMbh16Blou/WY8xJ/fog8MsFps8CKXFbbMpFoaAz8zZS7B/b0qhXLtN+ItB7+EVut+o8MrmJnbMiteFqsNf8agjUssQOC1IQOSUlTkmM+nBgMRv/B8+DvoFFYsSfG03HqJdXd7fVj6X8LS7VnwywnVy6qgGPNaiLNMpoysNQxa6xlQQmZYcA53Uj6alWGGcIZHZvAqz0qZb/uFhS4nzAKXugashFIsqFE1IIVESbwRd6FtwMfl5FrwtZFuytv0yfJYNG8TzcwQNKS/Go5NH1W22XJY9L4hH9Prhh1WM9BY9bwxo+rwbDqEwEiT0+rAbfqk1cxYHstJKvE77ouTo+qCGbThk2f5XePeLbzu7vogf+9vH/ac6tlyPnU+c546vnPgHDrfOmfOhYOd2PnN+cP5s/W29VPrl9avFfT+vYXPE6f2tH7/FxcujZs=</latexit><latexit sha1_base64="mQ1KtLtT5frIg5JmOrnmriowHDo=">AAAJ2HicjVZfb+NEEPcd/5rw74575MWiqnRIVhUH2iYPSNVdD6h0tOXSXg+SqFqv184qu15rd900WJZ4QEI88MIDfBze+A7wAfgEfABm7aSJN67Acjajmd/MzvxmduUgZVTpTufPe/dfe/2NN9/aarXffufd995/8PCDl0pkEpMLLJiQrwKkCKMJudBUM/IqlQTxgJHLYPrU2C+viVRUJOd6npIxR3FCI4qRBtUgvupePdju7HY6Hd/3XSP4B/sdEPr9Xtfvub4xwbN9+Gjnh3/++Puvs6uHW7+PQoEzThKNGVJq6HdSPc6R1BQzUrRHmSIpwlMUkyGICeJEjfMy18LdAU3oRkLCL9FuqV33yBFXas4DQHKkJ8q2GWWTbZjpqDfOaZJmmiS42ijKmKuFawp3QyoJ1mwOAsKSQq4uniCJsAZ6artMAC4lieqVYK5p6iHGxvnNvGbKNZ1+V9dAf5il0SCQJIY+VQbjxGggkZwDdVLMlKcmKCWqaO9Y/CoiaeSVJZdSKkVElOkpYoZFb4KSUGTaqxoRQP+JhDDuan/lPo4ECz/eiP3fUdcCrvNx7o9zAzNk1yzQ6QRLouuTkN/wlGdM07qWZIzIa25FQCwW0KIJ924liuvtSITkiBE+ziECrwdN43rv8iDg/zt7TbnpQa0glGDCCremNMRJFVnQmURpRGML+1UygJEW9ZEYRhJo5URPRPhZSCIE7IxzHpbqsGgafG9xODyGNLkBySiAnDo4hhwmFN/Ud5PAU+hdm4siHK8Y7pI7uXfvJB8FSjA4Z56Au4Wh+TiHdHQqgAwIRkA5N+nmo/J45wHLSFEsLSFVKfhYRnB0n54+P33hHj37/Pjk+Pz49GTQHgEvkHCFPEJy+oUkJClyGQdF3tn19zy4qvbN4u8Zztfhz2k80d8QxsTs1qFnoOXSb8ZD/Pkt+sAAq8UGL3JZYctMqqUx8BNT5hLc36tSKNd+I95y8Etote43OryAmbkts+JlsdrwJwzauMQCBF4bMiApRUWO+XxqMBDxE8+Hv4NOYcWSFE/LrZdYd7fXh6X/KSzdngW/nFC9rAqKMa+FOMtkyshaw6C1ngElZIYF53An5aNZGWYIZ3hkBq/yrJT5tl9Y6HLCLHCpa8BKKMWCGlUDUkiUxBtxF9oGfFxOrgVfG+mmvE2fLI9F8zbRzAxBQ/qr4dj0UWWbLYdF7xvyMb1u2GE1A41Vzxszqg7PpkMIjDQ5rQ7cpk9azYzlsZykEr/jPjs5qi6YQRs+eZbfNe7dwsvurg/y1/72Yc+pni3nQ+cj57HjOwfOofOlc+ZcONiJnZ+dX53fWt+2vm/92Pqpgt6/t/B55NSe1i//AjOSkDU=</latexit><latexit sha1_base64="mQ1KtLtT5frIg5JmOrnmriowHDo=">AAAJ2HicjVZfb+NEEPcd/5rw74575MWiqnRIVhUH2iYPSNVdD6h0tOXSXg+SqFqv184qu15rd900WJZ4QEI88MIDfBze+A7wAfgEfABm7aSJN67Acjajmd/MzvxmduUgZVTpTufPe/dfe/2NN9/aarXffufd995/8PCDl0pkEpMLLJiQrwKkCKMJudBUM/IqlQTxgJHLYPrU2C+viVRUJOd6npIxR3FCI4qRBtUgvupePdju7HY6Hd/3XSP4B/sdEPr9Xtfvub4xwbN9+Gjnh3/++Puvs6uHW7+PQoEzThKNGVJq6HdSPc6R1BQzUrRHmSIpwlMUkyGICeJEjfMy18LdAU3oRkLCL9FuqV33yBFXas4DQHKkJ8q2GWWTbZjpqDfOaZJmmiS42ijKmKuFawp3QyoJ1mwOAsKSQq4uniCJsAZ6artMAC4lieqVYK5p6iHGxvnNvGbKNZ1+V9dAf5il0SCQJIY+VQbjxGggkZwDdVLMlKcmKCWqaO9Y/CoiaeSVJZdSKkVElOkpYoZFb4KSUGTaqxoRQP+JhDDuan/lPo4ECz/eiP3fUdcCrvNx7o9zAzNk1yzQ6QRLouuTkN/wlGdM07qWZIzIa25FQCwW0KIJ924liuvtSITkiBE+ziECrwdN43rv8iDg/zt7TbnpQa0glGDCCremNMRJFVnQmURpRGML+1UygJEW9ZEYRhJo5URPRPhZSCIE7IxzHpbqsGgafG9xODyGNLkBySiAnDo4hhwmFN/Ud5PAU+hdm4siHK8Y7pI7uXfvJB8FSjA4Z56Au4Wh+TiHdHQqgAwIRkA5N+nmo/J45wHLSFEsLSFVKfhYRnB0n54+P33hHj37/Pjk+Pz49GTQHgEvkHCFPEJy+oUkJClyGQdF3tn19zy4qvbN4u8Zztfhz2k80d8QxsTs1qFnoOXSb8ZD/Pkt+sAAq8UGL3JZYctMqqUx8BNT5hLc36tSKNd+I95y8Etote43OryAmbkts+JlsdrwJwzauMQCBF4bMiApRUWO+XxqMBDxE8+Hv4NOYcWSFE/LrZdYd7fXh6X/KSzdngW/nFC9rAqKMa+FOMtkyshaw6C1ngElZIYF53An5aNZGWYIZ3hkBq/yrJT5tl9Y6HLCLHCpa8BKKMWCGlUDUkiUxBtxF9oGfFxOrgVfG+mmvE2fLI9F8zbRzAxBQ/qr4dj0UWWbLYdF7xvyMb1u2GE1A41Vzxszqg7PpkMIjDQ5rQ7cpk9azYzlsZykEr/jPjs5qi6YQRs+eZbfNe7dwsvurg/y1/72Yc+pni3nQ+cj57HjOwfOofOlc+ZcONiJnZ+dX53fWt+2vm/92Pqpgt6/t/B55NSe1i//AjOSkDU=</latexit><latexit sha1_base64="z+p23dkdGhVa20Q1jNHE5rneJ2s=">AAAJ2HicjVbNbuM2ENZu/2L3b7c99iI0CLAFhMBym8Q+FFjsZtsG2CbpOtlsaxsBRVESYVIUSCqOKwjoreihlx7ax+lz9G06lOzYohW0gkwPZr4ZznwzJBRkjCrd6/3z4OFbb7/z7ns7ne77H3z40cePHn/yWolcYnKJBRPyTYAUYTQll5pqRt5kkiAeMHIVzJ4b+9UNkYqK9EIvMjLlKE5pRDHSoBrF1/3rR7u9/V6v5/u+awT/6LAHwnA46PsD1zcmeHad5XN+/Xjn70kocM5JqjFDSo39XqanBZKaYkbK7iRXJEN4hmIyBjFFnKhpUeVaunugCd1ISPil2q20mx4F4koteABIjnSibJtRttnGuY4G04KmWa5JiuuNopy5WrimcDekkmDNFiAgLCnk6uIESYQ10NPYJQG4lCRqVoK5ppmHGJsWt4uGqdB09nNTA/1hlkaDQNIY+lQbjBOjgURyAdRJMVeeSlBGVNnds/hVRNLIq0qupEyKiCjTU8QMi16C0lDk2qsbEUD/iYQw7np/5T6JBAu/2Ir931E3Am7yceFPCwMzZDcs0OkUS6Kbk1Dc8oznTNOmluSMyBtuRUAsFtCihHt3EsXNdqRCcsQInxYQgTeDZnGzd0UQ8P+dvabc9KBREEoxYaXbUBripIos6FyiLKKxhf0+HcFIi+ZIjCMJtHKiExF+HZIIATvTgoeVOizbBt9bHg6PIU1uQTIKIKcJjiGHhOLb5m4SeAq9G3NRhNM1w31yL/fuveSjQAkG58wTcLcwtJgWkI7OBJABwQgoFybdYlId7yJgOSnLlSWkKgMfywiO7vOzl2ev3OMX35ycnlycnJ2OuhPgBRKukcdIzr6VhKRlIeOgLHr7/oEHV9WhWfwDw/km/CWNE/0jYUzM7xwGBlotw3Y8xF/coY8MsF5s8DKXNbbKpF5aAz8zZa7Aw4M6hWodtuItB7+C1uthq8MrmJm7MmtelqsNf8agjSssQOC1ISOSUVQWmC9mBgMRv/R8+DvqlVYsSfGs2nqFdfcHQ1iGX8HSH1jwq4TqVVVQjHktxHkuM0Y2Ggat9QwoJXMsOIc7qZjMqzBjOMMTM3i1Z60sdv3SQlcTZoErXQtWQikW1KhakEKiNN6Ku9S24ONqci34xki35W36ZHksm7eNZmYIWtJfD8e2j6rabDkse9+Sj+l1yw7rGWitetGaUX14th1CYKTNaX3gtn2yemYsj9UkVfg998XpcX3BjLrwybP6rnHvF173932Qf/B3nw6WHz87zmfO584Tx3eOnKfOd865c+lgJ3Z+d/50/ur81Pml82vntxr68MHS51On8XT++BfV8Yu/</latexit>
z1<latexit sha1_base64="V7g08aoOjvxavyziU7UXAQe9o4U=">AAAJ2HicjVZfb9s2EFfb/Ym9dWu7x70ICwJ0gBBY2ZLYDwOCNt0WoEuyJmm62kJAUZRMmBQFkorjCgL2NuxhL3vYPs6AfYt9mx0lO7ZoBZsg04e73x3vfnckFGaMKt3r/XPv/oP33v/gw41O96OPH37y6aPHT14rkUtMLrBgQr4JkSKMpuRCU83Im0wSxENGLsPJc2O/vCZSUZGe61lGAo6SlMYUIw2qs3dX/tWjzd52r9fzfd81gr+/1wNhMOjv+H3XNyZ4Ng8e/p27juOcXj3e+GsUCZxzkmrMkFJDv5fpoEBSU8xI2R3limQIT1BChiCmiBMVFFWupbsFmsiNhYRfqt1Ku+pRIK7UjIeA5EiPlW0zyjbbMNdxPyhomuWapLjeKM6Zq4VrCncjKgnWbAYCwpJCri4eI4mwBnoau4wBLiWJm5VgrmnmIcaC4mbWMBWaTt41NdAfZmk0CCRNoE+1wTgxGkokZ0CdFFPlqTHKiCq7Wxa/ikgae1XJlZRJERNleoqYYdEbozQSufbqRoTQfyIhjLvcX7lPY8GiL9di/3fUlYCrfJz7QWFghuyGBTqdYkl0cxKKG57xnGna1JKcEXnNrQiIJQJaNOberURxsx2pkBwxwoMCIvBm0Cxp9q4IQ/6/s9eUmx40CkIpJqx0G0pDnFSxBZ1KlMU0sbA/pGcw0qI5EsNYAq2c6LGIvolIjICdoOBRpY7KtsH35ofDY0iTG5CMAshpghPIYUzxTXM3CTxF3rW5KKJgyfAOuZN7907yUagEg3PmCbhbGJoFBaSjMwFkQDACyplJtxhVx7sIWU7KcmGJqMrAxzKCo/v85OXJK/fwxbdHx0fnRyfHZ90R8AIJ18hDJCffSULSspBJWBa9bX/Xg6tqzyz+ruF8Ff6SJmP9E2FMTG8d+gZaLYN2PMSf3aL3DbBebPA8lyW2yqReWgM/M2UuwIPdOoVqHbTiLQe/gtbrXqvDK5iZ2zJrXuarDX/GoI0LLEDgtSFnJKOoLDCfTQwGIn7l+fC33yutWJLiSbX1Autu9wewDL6GZadvwS/HVC+qgmLMayFOc5kxstIwaK1nQCmZYsE53EnFaFqFGcIZHpnBqz1rZbHplxa6mjALXOlasBJKsaBG1YIUEqXJWty5tgWfVJNrwVdGui1v0yfLY968dTQzQ9CS/nI41n1U1WbLYd77lnxMr1t2WM5Aa9Wz1ozqw7PuEAEjbU7LA7fuk9UzY3ksJqnCb7kvjg/rC+asC588i+8a927h9c62D/KP/uZB36mfDedz5wvnqeM7+86B871z6lw42Emc35w/nD87bzs/d37p/FpD79+b+3zmNJ7O7/8CuwONdQ==</latexit><latexit sha1_base64="xAosoZfITzQxTFF+VKA4fyGVA3w=">AAAJ2HicjVZfb9s2EFe7f7G3bu32uBdhQYAOEAIrWxL7YUDQpt0CdEnWJE1XWwgoipIJk6JAUnFcQcDehj3sZQ/bN9jXGLBvsW+zo2THFq1gE2T6cPe7493vjoTCjFGle71/7t1/59333v9go9P98KMHH3/y8NGnr5TIJSYXWDAhX4dIEUZTcqGpZuR1JgniISOX4eSpsV9eE6moSM/1LCMBR0lKY4qRBtXZ2yv/6uFmb7vX6/m+7xrB39/rgTAY9Hf8vusbEzybBw/+zreed/88vXq08dcoEjjnJNWYIaWGfi/TQYGkppiRsjvKFckQnqCEDEFMEScqKKpcS3cLNJEbCwm/VLuVdtWjQFypGQ8ByZEeK9tmlG22Ya7jflDQNMs1SXG9UZwzVwvXFO5GVBKs2QwEhCWFXF08RhJhDfQ0dhkDXEoSNyvBXNPMQ4wFxc2sYSo0nbxtaqA/zNJoEEiaQJ9qg3FiNJRIzoA6KabKU2OUEVV2tyx+FZE09qqSKymTIibK9BQxw6I3Rmkkcu3VjQih/0RCGHe5v3Ifx4JFX67F/u+oKwFX+Tj3g8LADNkNC3Q6xZLo5iQUNzzjOdO0qSU5I/KaWxEQSwS0aMy9W4niZjtSITlihAcFRODNoFnS7F0Rhvx/Z68pNz1oFIRSTFjpNpSGOKliCzqVKItpYmG/T89gpEVzJIaxBFo50WMRfRORGAE7QcGjSh2VbYPvzQ+Hx5AmNyAZBZDTBCeQw5jim+ZuEniKvGtzUUTBkuEdcif37p3ko1AJBufME3C3MDQLCkhHZwLIgGAElDOTbjGqjncRspyU5cISUZWBj2UER/fpyYuTl+7hs+dHx0fnRyfHZ90R8AIJ18hDJCffSkLSspBJWBa9bX/Xg6tqzyz+ruF8Ff6CJmP9I2FMTG8d+gZaLYN2PMSf3aL3DbBebPA8lyW2yqReWgM/MWUuwIPdOoVqHbTiLQe/gtbrXqvDS5iZ2zJrXuarDX/CoI0LLEDgtSFnJKOoLDCfTQwGIn7l+fC33yutWJLiSbX1Autu9wewDL6GZadvwS/HVC+qgmLMayFOc5kxstIwaK1nQCmZYsE53EnFaFqFGcIZHpnBqz1rZbHplxa6mjALXOlasBJKsaBG1YIUEqXJWty5tgWfVJNrwVdGui1v0yfLY968dTQzQ9CS/nI41n1U1WbLYd77lnxMr1t2WM5Aa9Wz1ozqw7PuEAEjbU7LA7fuk9UzY3ksJqnCb7nPjg/rC+asC588i+8a927h1c62D/IP/uZB36mfDedz5wvnseM7+86B851z6lw42EmcX53fnT86bzo/dX7u/FJD79+b+3zmNJ7Ob/8C0OCOcA==</latexit><latexit sha1_base64="xAosoZfITzQxTFF+VKA4fyGVA3w=">AAAJ2HicjVZfb9s2EFe7f7G3bu32uBdhQYAOEAIrWxL7YUDQpt0CdEnWJE1XWwgoipIJk6JAUnFcQcDehj3sZQ/bN9jXGLBvsW+zo2THFq1gE2T6cPe7493vjoTCjFGle71/7t1/59333v9go9P98KMHH3/y8NGnr5TIJSYXWDAhX4dIEUZTcqGpZuR1JgniISOX4eSpsV9eE6moSM/1LCMBR0lKY4qRBtXZ2yv/6uFmb7vX6/m+7xrB39/rgTAY9Hf8vusbEzybBw/+zreed/88vXq08dcoEjjnJNWYIaWGfi/TQYGkppiRsjvKFckQnqCEDEFMEScqKKpcS3cLNJEbCwm/VLuVdtWjQFypGQ8ByZEeK9tmlG22Ya7jflDQNMs1SXG9UZwzVwvXFO5GVBKs2QwEhCWFXF08RhJhDfQ0dhkDXEoSNyvBXNPMQ4wFxc2sYSo0nbxtaqA/zNJoEEiaQJ9qg3FiNJRIzoA6KabKU2OUEVV2tyx+FZE09qqSKymTIibK9BQxw6I3Rmkkcu3VjQih/0RCGHe5v3Ifx4JFX67F/u+oKwFX+Tj3g8LADNkNC3Q6xZLo5iQUNzzjOdO0qSU5I/KaWxEQSwS0aMy9W4niZjtSITlihAcFRODNoFnS7F0Rhvx/Z68pNz1oFIRSTFjpNpSGOKliCzqVKItpYmG/T89gpEVzJIaxBFo50WMRfRORGAE7QcGjSh2VbYPvzQ+Hx5AmNyAZBZDTBCeQw5jim+ZuEniKvGtzUUTBkuEdcif37p3ko1AJBufME3C3MDQLCkhHZwLIgGAElDOTbjGqjncRspyU5cISUZWBj2UER/fpyYuTl+7hs+dHx0fnRyfHZ90R8AIJ18hDJCffSkLSspBJWBa9bX/Xg6tqzyz+ruF8Ff6CJmP9I2FMTG8d+gZaLYN2PMSf3aL3DbBebPA8lyW2yqReWgM/MWUuwIPdOoVqHbTiLQe/gtbrXqvDS5iZ2zJrXuarDX/CoI0LLEDgtSFnJKOoLDCfTQwGIn7l+fC33yutWJLiSbX1Autu9wewDL6GZadvwS/HVC+qgmLMayFOc5kxstIwaK1nQCmZYsE53EnFaFqFGcIZHpnBqz1rZbHplxa6mjALXOlasBJKsaBG1YIUEqXJWty5tgWfVJNrwVdGui1v0yfLY968dTQzQ9CS/nI41n1U1WbLYd77lnxMr1t2WM5Aa9Wz1ozqw7PuEAEjbU7LA7fuk9UzY3ksJqnCb7nPjg/rC+asC588i+8a927h1c62D/IP/uZB36mfDedz5wvnseM7+86B851z6lw42EmcX53fnT86bzo/dX7u/FJD79+b+3zmNJ7Ob/8C0OCOcA==</latexit><latexit sha1_base64="yCmzR+UYSiB70+ceKvNWi19m/ko=">AAAJ2HicjVZfb9s2EFfb/Ym9f233uBdhQYAOEAIrWxL7YUDRptsCdElWJ01XWwgoipIJk6JAUnFcQcDehj3sZQ/bx9nn2LfZUbJji1awCTJ9uPvd8e53R0JhxqjSvd4/9+4/eO/9Dz7c6nQ/+viTTz97+OjxayVyickFFkzINyFShNGUXGiqGXmTSYJ4yMhlOH1u7JfXRCoq0nM9z0jAUZLSmGKkQTV8d+VfPdzu7fZ6Pd/3XSP4hwc9EAaD/p7fd31jgmfbWTxnV4+2/h5HAuecpBozpNTI72U6KJDUFDNSdse5IhnCU5SQEYgp4kQFRZVr6e6AJnJjIeGXarfSrnsUiCs15yEgOdITZduMss02ynXcDwqaZrkmKa43inPmauGawt2ISoI1m4OAsKSQq4snSCKsgZ7GLhOAS0niZiWYa5p5iLGguJk3TIWm03dNDfSHWRoNAkkT6FNtME6MhhLJOVAnxUx5aoIyosrujsWvIpLGXlVyJWVSxESZniJmWPQmKI1Err26ESH0n0gI4672V+6TWLDoq43Y/x11LeA6H+d+UBiYIbthgU6nWBLdnITihmc8Z5o2tSRnRF5zKwJiiYAWTbh3K1HcbEcqJEeM8KCACLwZNEuavSvCkP/v7DXlpgeNglCKCSvdhtIQJ1VsQWcSZTFNLOyP6RBGWjRHYhRLoJUTPRHRtxGJEbATFDyq1FHZNvje4nB4DGlyA5JRADlNcAI5TCi+ae4mgafIuzYXRRSsGN4jd3Lv3kk+CpVgcM48AXcLQ/OggHR0JoAMCEZAOTfpFuPqeBchy0lZLi0RVRn4WEZwdJ+fvjx95R69+O745Pj8+PRk2B0DL5BwjTxCcvq9JCQtC5mEZdHb9fc9uKoOzOLvG87X4S9pMtE/E8bE7Nahb6DVMmjHQ/z5LfrQAOvFBi9yWWGrTOqlNfAzU+YSPNivU6jWQSvecvAraL0etDq8gpm5LbPmZbHa8GcM2rjEAgReGzIkGUVlgfl8ajAQ8WvPh7/DXmnFkhRPq62XWHe3P4Bl8A0se30LfjmhelkVFGNeC3GWy4yRtYZBaz0DSskMC87hTirGsyrMCM7w2Axe7Vkri22/tNDVhFngSteClVCKBTWqFqSQKE024i60LfikmlwLvjbSbXmbPlkei+ZtopkZgpb0V8Ox6aOqNlsOi9635GN63bLDagZaq563ZlQfnk2HCBhpc1oduE2frJ4Zy2M5SRV+x31xclRfMMMufPIsv2vcu4XXe7s+yD/520/7i4+fLecL50vnieM7h85T5wfnzLlwsJM4vzt/On913nZ+6fza+a2G3r+38PncaTydP/4FgCqL0Q==</latexit>
z2<latexit sha1_base64="TZvAKWPUDTKESOwa2HzSiORibKQ=">AAAJ2HicjVZfb9s2EFfb/Ym9dWu7x70ICwJ0gBBY3pLYDwOCNt0WoEuyOmm62kZAUZRMmBQFkorjCgL2NuxhL3vYPs6AfYt9mx0lO7ZoBZsg04e73x3vfnckFKSMKt3p/HPv/oP33v/gw61W+6OPH37y6aPHT14rkUlMLrBgQr4JkCKMJuRCU83Im1QSxANGLoPpc2O/vCZSUZGc63lKxhzFCY0oRhpUg3dX3atH253dTqfj+75rBP9gvwNCv9/r+j3XNyZ4tg8f/p25juOcXT3e+msUCpxxkmjMkFJDv5PqcY6kppiRoj3KFEkRnqKYDEFMECdqnJe5Fu4OaEI3EhJ+iXZL7bpHjrhScx4AkiM9UbbNKJtsw0xHvXFOkzTTJMHVRlHGXC1cU7gbUkmwZnMQEJYUcnXxBEmENdBT22UCcClJVK8Ec01TDzE2zm/mNVOu6fRdXQP9YZZGg0CSGPpUGYwTo4FEcg7USTFTnpqglKiivWPxq4ikkVeWXEqpFBFRpqeIGRa9CUpCkWmvakQA/ScSwrir/ZX7NBIs/HIj9n9HXQu4zse5P84NzJBds0CnEyyJrk9CfsNTnjFN61qSMSKvuRUBsVhAiybcu5UorrcjEZIjRvg4hwi8HjSN673Lg4D/7+w15aYHtYJQggkr3JrSECdVZEFnEqURjS3sD8kARlrUR2IYSaCVEz0R4TchiRCwM855WKrDomnwvcXh8BjS5AYkowBy6uAYcphQfFPfTQJPoXdtLopwvGK4S+7k3r2TfBQoweCceQLuFobm4xzS0akAMiAYAeXcpJuPyuOdBywjRbG0hFSl4GMZwdF9fvry9JV79OLb45Pj8+PTk0F7BLxAwhXyCMnpd5KQpMhlHBR5Z9ff8+Cq2jeLv2c4X4e/pPFE/0QYE7Nbh56Blku/GQ/x57foAwOsFhu8yGWFLTOplsbAz0yZS3B/r0qhXPuNeMvBL6HVut/o8Apm5rbMipfFasOfMWjjEgsQeG3IgKQUFTnm86nBQMSvPB/+DjqFFUtSPC23XmLd3V4flv7XsHR7FvxyQvWyKijGvBbiLJMpI2sNg9Z6BpSQGRacw52Uj2ZlmCGc4ZEZvMqzUubbfmGhywmzwKWuASuhFAtqVA1IIVESb8RdaBvwcTm5FnxtpJvyNn2yPBbN20QzMwQN6a+GY9NHlW22HBa9b8jH9Lphh9UMNFY9b8yoOjybDiEw0uS0OnCbPmk1M5bHcpJK/I774uSoumAGbfjkWX7XuHcLr7u7Psg/+tuHPad6tpzPnS+cp47vHDiHzvfOmXPhYCd2fnP+cP5svW393Pql9WsFvX9v4fOZU3tav/8LxHWNdg==</latexit><latexit sha1_base64="81/ItuAdF3EDwYZiH30cbjCQeeo=">AAAJ2HicjVZfb9s2EFe7f7G3bu32uBdhQYAOEALLWxL7YUDQpt0CdElWJ01X2wgoipIJk6JAUnFcQcDehj3sZQ/bN9jXGLBvsW+zo2THFq1gE2T6cPe7493vjoSClFGlO51/7t1/59333v9gq9X+8KMHH3/y8NGnr5TIJCYXWDAhXwdIEUYTcqGpZuR1KgniASOXwfSpsV9eE6moSM71PCVjjuKERhQjDarB26vu1cPtzm6n0/F93zWCf7DfAaHf73X9nusbEzzbhw/+znaet/88u3q09dcoFDjjJNGYIaWGfifV4xxJTTEjRXuUKZIiPEUxGYKYIE7UOC9zLdwd0IRuJCT8Eu2W2nWPHHGl5jwAJEd6omybUTbZhpmOeuOcJmmmSYKrjaKMuVq4pnA3pJJgzeYgICwp5OriCZIIa6CntssE4FKSqF4J5pqmHmJsnN/Ma6Zc0+nbugb6wyyNBoEkMfSpMhgnRgOJ5Byok2KmPDVBKVFFe8fiVxFJI68suZRSKSKiTE8RMyx6E5SEItNe1YgA+k8khHFX+yv3cSRY+OVG7P+OuhZwnY9zf5wbmCG7ZoFOJ1gSXZ+E/IanPGOa1rUkY0RecysCYrGAFk24dytRXG9HIiRHjPBxDhF4PWga13uXBwH/39lryk0PagWhBBNWuDWlIU6qyILOJEojGlvY75MBjLSoj8QwkkArJ3oiwm9CEiFgZ5zzsFSHRdPge4vD4TGkyQ1IRgHk1MEx5DCh+Ka+mwSeQu/aXBTheMVwl9zJvXsn+ShQgsE58wTcLQzNxzmko1MBZEAwAsq5STcflcc7D1hGimJpCalKwccygqP79PTF6Uv36Nnz45Pj8+PTk0F7BLxAwhXyCMnpt5KQpMhlHBR5Z9ff8+Cq2jeLv2c4X4e/oPFE/0gYE7Nbh56Blku/GQ/x57foAwOsFhu8yGWFLTOplsbAT0yZS3B/r0qhXPuNeMvBL6HVut/o8BJm5rbMipfFasOfMGjjEgsQeG3IgKQUFTnm86nBQMSvPB/+DjqFFUtSPC23XmLd3V4flv7XsHR7FvxyQvWyKijGvBbiLJMpI2sNg9Z6BpSQGRacw52Uj2ZlmCGc4ZEZvMqzUubbfmGhywmzwKWuASuhFAtqVA1IIVESb8RdaBvwcTm5FnxtpJvyNn2yPBbN20QzMwQN6a+GY9NHlW22HBa9b8jH9Lphh9UMNFY9b8yoOjybDiEw0uS0OnCbPmk1M5bHcpJK/I777OSoumAGbfjkWX7XuHcLr7q7Psg/+NuHPad6tpzPnS+cx47vHDiHznfOmXPhYCd2fnV+d/5ovWn91Pq59UsFvX9v4fOZU3tav/0L2lKOcQ==</latexit><latexit sha1_base64="81/ItuAdF3EDwYZiH30cbjCQeeo=">AAAJ2HicjVZfb9s2EFe7f7G3bu32uBdhQYAOEALLWxL7YUDQpt0CdElWJ01X2wgoipIJk6JAUnFcQcDehj3sZQ/bN9jXGLBvsW+zo2THFq1gE2T6cPe7493vjoSClFGlO51/7t1/59333v9gq9X+8KMHH3/y8NGnr5TIJCYXWDAhXwdIEUYTcqGpZuR1KgniASOXwfSpsV9eE6moSM71PCVjjuKERhQjDarB26vu1cPtzm6n0/F93zWCf7DfAaHf73X9nusbEzzbhw/+znaet/88u3q09dcoFDjjJNGYIaWGfifV4xxJTTEjRXuUKZIiPEUxGYKYIE7UOC9zLdwd0IRuJCT8Eu2W2nWPHHGl5jwAJEd6omybUTbZhpmOeuOcJmmmSYKrjaKMuVq4pnA3pJJgzeYgICwp5OriCZIIa6CntssE4FKSqF4J5pqmHmJsnN/Ma6Zc0+nbugb6wyyNBoEkMfSpMhgnRgOJ5Byok2KmPDVBKVFFe8fiVxFJI68suZRSKSKiTE8RMyx6E5SEItNe1YgA+k8khHFX+yv3cSRY+OVG7P+OuhZwnY9zf5wbmCG7ZoFOJ1gSXZ+E/IanPGOa1rUkY0RecysCYrGAFk24dytRXG9HIiRHjPBxDhF4PWga13uXBwH/39lryk0PagWhBBNWuDWlIU6qyILOJEojGlvY75MBjLSoj8QwkkArJ3oiwm9CEiFgZ5zzsFSHRdPge4vD4TGkyQ1IRgHk1MEx5DCh+Ka+mwSeQu/aXBTheMVwl9zJvXsn+ShQgsE58wTcLQzNxzmko1MBZEAwAsq5STcflcc7D1hGimJpCalKwccygqP79PTF6Uv36Nnz45Pj8+PTk0F7BLxAwhXyCMnpt5KQpMhlHBR5Z9ff8+Cq2jeLv2c4X4e/oPFE/0gYE7Nbh56Blku/GQ/x57foAwOsFhu8yGWFLTOplsbAT0yZS3B/r0qhXPuNeMvBL6HVut/o8BJm5rbMipfFasOfMGjjEgsQeG3IgKQUFTnm86nBQMSvPB/+DjqFFUtSPC23XmLd3V4flv7XsHR7FvxyQvWyKijGvBbiLJMpI2sNg9Z6BpSQGRacw52Uj2ZlmCGc4ZEZvMqzUubbfmGhywmzwKWuASuhFAtqVA1IIVESb8RdaBvwcTm5FnxtpJvyNn2yPBbN20QzMwQN6a+GY9NHlW22HBa9b8jH9Lphh9UMNFY9b8yoOjybDiEw0uS0OnCbPmk1M5bHcpJK/I777OSoumAGbfjkWX7XuHcLr7q7Psg/+NuHPad6tpzPnS+cx47vHDiHznfOmXPhYCd2fnV+d/5ovWn91Pq59UsFvX9v4fOZU3tav/0L2lKOcQ==</latexit><latexit sha1_base64="g8QBduw5zRf6C5jqumitT8HWfDk=">AAAJ2HicjVZfb9s2EFfb/Ym9f233uBdhQYAOEALLWxL7YUDRptsCdEnWJE1XWwgoipIJk6JAUnFcQcDehj3sZQ/bx9nn2LfZUbJji1awCTJ9uPvd8e53R0JhxqjSvd4/9+4/eO/9Dz7c6nQ/+viTTz97+OjxayVyickFFkzINyFShNGUXGiqGXmTSYJ4yMhlOH1u7JfXRCoq0nM9z0jAUZLSmGKkQXX27qp/9XC7t9vr9Xzfd43gH+z3QBgOB31/4PrGBM+2s3hOrx5t/T2OBM45STVmSKmR38t0UCCpKWak7I5zRTKEpyghIxBTxIkKiirX0t0BTeTGQsIv1W6lXfcoEFdqzkNAcqQnyrYZZZttlOt4EBQ0zXJNUlxvFOfM1cI1hbsRlQRrNgcBYUkhVxdPkERYAz2NXSYAl5LEzUow1zTzEGNBcTNvmApNp++aGugPszQaBJIm0KfaYJwYDSWSc6BOipny1ARlRJXdHYtfRSSNvarkSsqkiIkyPUXMsOhNUBqJXHt1I0LoP5EQxl3tr9wnsWDRVxux/zvqWsB1Ps79oDAwQ3bDAp1OsSS6OQnFDc94zjRtaknOiLzmVgTEEgEtmnDvVqK42Y5USI4Y4UEBEXgzaJY0e1eEIf/f2WvKTQ8aBaEUE1a6DaUhTqrYgs4kymKaWNgf0zMYadEciVEsgVZO9ERE30YkRsBOUPCoUkdl2+B7i8PhMaTJDUhGAeQ0wQnkMKH4prmbBJ4i79pcFFGwYrhP7uTevZN8FCrB4Jx5Au4WhuZBAenoTAAZEIyAcm7SLcbV8S5ClpOyXFoiqjLwsYzg6D4/eXnyyj188d3R8dH50cnxWXcMvEDCNfIQyen3kpC0LGQSlkVv19/z4KraN4u/Zzhfh7+kyUT/TBgTs1uHgYFWy7AdD/Hnt+gDA6wXG7zIZYWtMqmX1sDPTJlL8HCvTqFah614y8GvoPW63+rwCmbmtsyal8Vqw58xaOMSCxB4bcgZySgqC8znU4OBiF97Pvwd9EorlqR4Wm29xLq7gyEsw29g6Q8s+OWE6mVVUIx5LcRpLjNG1hoGrfUMKCUzLDiHO6kYz6owIzjDYzN4tWetLLb90kJXE2aBK10LVkIpFtSoWpBCojTZiLvQtuCTanIt+NpIt+Vt+mR5LJq3iWZmCFrSXw3Hpo+q2mw5LHrfko/pdcsOqxlorXremlF9eDYdImCkzWl14DZ9snpmLI/lJFX4HffF8WF9wZx14ZNn+V3j3i287u/6IP/kbz8dLD5+tpwvnC+dJ47vHDhPnR+cU+fCwU7i/O786fzVedv5pfNr57caev/ewudzp/F0/vgXiZyL0g==</latexit>
zn<latexit sha1_base64="4ZFc4Osjg/3AMiZMY0aOID5dIWQ=">AAAJ2HicjVZfb9s2EFfb/Ym9dWu7x70ICwJ0gBBY2ZLYDwOCNt0WoEuyJmm62kJAUZRMmBQFkorjCgL2NuxhL3vYPs6AfYt9mx0lO7ZoBZsg04e73x3vfnckFGaMKt3r/XPv/oP33v/gw41O96OPH37y6aPHT14rkUtMLrBgQr4JkSKMpuRCU83Im0wSxENGLsPJc2O/vCZSUZGe61lGAo6SlMYUIw2qs3dX6dWjzd52r9fzfd81gr+/1wNhMOjv+H3XNyZ4Ng8e/p27juOcXj3e+GsUCZxzkmrMkFJDv5fpoEBSU8xI2R3limQIT1BChiCmiBMVFFWupbsFmsiNhYRfqt1Ku+pRIK7UjIeA5EiPlW0zyjbbMNdxPyhomuWapLjeKM6Zq4VrCncjKgnWbAYCwpJCri4eI4mwBnoau4wBLiWJm5VgrmnmIcaC4mbWMBWaTt41NdAfZmk0CCRNoE+1wTgxGkokZ0CdFFPlqTHKiCq7Wxa/ikgae1XJlZRJERNleoqYYdEbozQSufbqRoTQfyIhjLvcX7lPY8GiL9di/3fUlYCrfJz7QWFghuyGBTqdYkl0cxKKG57xnGna1JKcEXnNrQiIJQJaNOberURxsx2pkBwxwoMCIvBm0Cxp9q4IQ/6/s9eUmx40CkIpJqx0G0pDnFSxBZ1KlMU0sbA/pGcw0qI5EsNYAq2c6LGIvolIjICdoOBRpY7KtsH35ofDY0iTG5CMAshpghPIYUzxTXM3CTxF3rW5KKJgyfAOuZN7907yUagEg3PmCbhbGJoFBaSjMwFkQDACyplJtxhVx7sIWU7KcmGJqMrAxzKCo/v85OXJK/fwxbdHx0fnRyfHZ90R8AIJ18hDJCffSULSspBJWBa9bX/Xg6tqzyz+ruF8Ff6SJmP9E2FMTG8d+gZaLYN2PMSf3aL3DbBebPA8lyW2yqReWgM/M2UuwIPdOoVqHbTiLQe/gtbrXqvDK5iZ2zJrXuarDX/GoI0LLEDgtSFnJKOoLDCfTQwGIn7l+fC33yutWJLiSbX1Autu9wewDL6GZadvwS/HVC+qgmLMayFOc5kxstIwaK1nQCmZYsE53EnFaFqFGcIZHpnBqz1rZbHplxa6mjALXOlasBJKsaBG1YIUEqXJWty5tgWfVJNrwVdGui1v0yfLY968dTQzQ9CS/nI41n1U1WbLYd77lnxMr1t2WM5Aa9Wz1ozqw7PuEAEjbU7LA7fuk9UzY3ksJqnCb7kvjg/rC+asC588i+8a927h9c62D/KP/uZB36mfDedz5wvnqeM7+86B871z6lw42Emc35w/nD87bzs/d37p/FpD79+b+3zmNJ7O7/8C+0uNsg==</latexit><latexit sha1_base64="ApJtwQVj/qeEN9CA1NtMsNSc5HI=">AAAJ2HicjVZfb9s2EFe7f7G3bu32uBdhQYAOEAIrWxL7YUDQpt0CdEnWJE1XWwgoipIJk6JAUnFcQcDehj3sZQ/bN9jXGLBvsW+zo2THFq1gE2T6cPe7493vjoTCjFGle71/7t1/59333v9go9P98KMHH3/y8NGnr5TIJSYXWDAhX4dIEUZTcqGpZuR1JgniISOX4eSpsV9eE6moSM/1LCMBR0lKY4qRBtXZ26v06uFmb7vX6/m+7xrB39/rgTAY9Hf8vusbEzybBw/+zreed/88vXq08dcoEjjnJNWYIaWGfi/TQYGkppiRsjvKFckQnqCEDEFMEScqKKpcS3cLNJEbCwm/VLuVdtWjQFypGQ8ByZEeK9tmlG22Ya7jflDQNMs1SXG9UZwzVwvXFO5GVBKs2QwEhCWFXF08RhJhDfQ0dhkDXEoSNyvBXNPMQ4wFxc2sYSo0nbxtaqA/zNJoEEiaQJ9qg3FiNJRIzoA6KabKU2OUEVV2tyx+FZE09qqSKymTIibK9BQxw6I3Rmkkcu3VjQih/0RCGHe5v3Ifx4JFX67F/u+oKwFX+Tj3g8LADNkNC3Q6xZLo5iQUNzzjOdO0qSU5I/KaWxEQSwS0aMy9W4niZjtSITlihAcFRODNoFnS7F0Rhvx/Z68pNz1oFIRSTFjpNpSGOKliCzqVKItpYmG/T89gpEVzJIaxBFo50WMRfRORGAE7QcGjSh2VbYPvzQ+Hx5AmNyAZBZDTBCeQw5jim+ZuEniKvGtzUUTBkuEdcif37p3ko1AJBufME3C3MDQLCkhHZwLIgGAElDOTbjGqjncRspyU5cISUZWBj2UER/fpyYuTl+7hs+dHx0fnRyfHZ90R8AIJ18hDJCffSkLSspBJWBa9bX/Xg6tqzyz+ruF8Ff6CJmP9I2FMTG8d+gZaLYN2PMSf3aL3DbBebPA8lyW2yqReWgM/MWUuwIPdOoVqHbTiLQe/gtbrXqvDS5iZ2zJrXuarDX/CoI0LLEDgtSFnJKOoLDCfTQwGIn7l+fC33yutWJLiSbX1Autu9wewDL6GZadvwS/HVC+qgmLMayFOc5kxstIwaK1nQCmZYsE53EnFaFqFGcIZHpnBqz1rZbHplxa6mjALXOlasBJKsaBG1YIUEqXJWty5tgWfVJNrwVdGui1v0yfLY968dTQzQ9CS/nI41n1U1WbLYd77lnxMr1t2WM5Aa9Wz1ozqw7PuEAEjbU7LA7fuk9UzY3ksJqnCb7nPjg/rC+asC588i+8a927h1c62D/IP/uZB36mfDedz5wvnseM7+86B851z6lw42EmcX53fnT86bzo/dX7u/FJD79+b+3zmNJ7Ob/8CETeOrQ==</latexit><latexit sha1_base64="ApJtwQVj/qeEN9CA1NtMsNSc5HI=">AAAJ2HicjVZfb9s2EFe7f7G3bu32uBdhQYAOEAIrWxL7YUDQpt0CdEnWJE1XWwgoipIJk6JAUnFcQcDehj3sZQ/bN9jXGLBvsW+zo2THFq1gE2T6cPe7493vjoTCjFGle71/7t1/59333v9go9P98KMHH3/y8NGnr5TIJSYXWDAhX4dIEUZTcqGpZuR1JgniISOX4eSpsV9eE6moSM/1LCMBR0lKY4qRBtXZ26v06uFmb7vX6/m+7xrB39/rgTAY9Hf8vusbEzybBw/+zreed/88vXq08dcoEjjnJNWYIaWGfi/TQYGkppiRsjvKFckQnqCEDEFMEScqKKpcS3cLNJEbCwm/VLuVdtWjQFypGQ8ByZEeK9tmlG22Ya7jflDQNMs1SXG9UZwzVwvXFO5GVBKs2QwEhCWFXF08RhJhDfQ0dhkDXEoSNyvBXNPMQ4wFxc2sYSo0nbxtaqA/zNJoEEiaQJ9qg3FiNJRIzoA6KabKU2OUEVV2tyx+FZE09qqSKymTIibK9BQxw6I3Rmkkcu3VjQih/0RCGHe5v3Ifx4JFX67F/u+oKwFX+Tj3g8LADNkNC3Q6xZLo5iQUNzzjOdO0qSU5I/KaWxEQSwS0aMy9W4niZjtSITlihAcFRODNoFnS7F0Rhvx/Z68pNz1oFIRSTFjpNpSGOKliCzqVKItpYmG/T89gpEVzJIaxBFo50WMRfRORGAE7QcGjSh2VbYPvzQ+Hx5AmNyAZBZDTBCeQw5jim+ZuEniKvGtzUUTBkuEdcif37p3ko1AJBufME3C3MDQLCkhHZwLIgGAElDOTbjGqjncRspyU5cISUZWBj2UER/fpyYuTl+7hs+dHx0fnRyfHZ90R8AIJ18hDJCffSkLSspBJWBa9bX/Xg6tqzyz+ruF8Ff6CJmP9I2FMTG8d+gZaLYN2PMSf3aL3DbBebPA8lyW2yqReWgM/MWUuwIPdOoVqHbTiLQe/gtbrXqvDS5iZ2zJrXuarDX/CoI0LLEDgtSFnJKOoLDCfTQwGIn7l+fC33yutWJLiSbX1Autu9wewDL6GZadvwS/HVC+qgmLMayFOc5kxstIwaK1nQCmZYsE53EnFaFqFGcIZHpnBqz1rZbHplxa6mjALXOlasBJKsaBG1YIUEqXJWty5tgWfVJNrwVdGui1v0yfLY968dTQzQ9CS/nI41n1U1WbLYd77lnxMr1t2WM5Aa9Wz1ozqw7PuEAEjbU7LA7fuk9UzY3ksJqnCb7nPjg/rC+asC588i+8a927h1c62D/IP/uZB36mfDedz5wvnseM7+86B851z6lw42EmcX53fnT86bzo/dX7u/FJD79+b+3zmNJ7Ob/8CETeOrQ==</latexit><latexit sha1_base64="kRKiSD+HgKged5aOJIkSLIO3sLs=">AAAJ2HicjVZfb9s2EFfb/Ym9f233uBdhQYAOEAIrWxL7YUDRptsCdElWJ01XWwgoipIJk6JAUnFcQcDehj3sZQ/bx9nn2LfZUbJji1awCTJ9uPvd8e53R0JhxqjSvd4/9+4/eO/9Dz7c6nQ/+viTTz97+OjxayVyickFFkzINyFShNGUXGiqGXmTSYJ4yMhlOH1u7JfXRCoq0nM9z0jAUZLSmGKkQTV8d5VePdzu7fZ6Pd/3XSP4hwc9EAaD/p7fd31jgmfbWTxnV4+2/h5HAuecpBozpNTI72U6KJDUFDNSdse5IhnCU5SQEYgp4kQFRZVr6e6AJnJjIeGXarfSrnsUiCs15yEgOdITZduMss02ynXcDwqaZrkmKa43inPmauGawt2ISoI1m4OAsKSQq4snSCKsgZ7GLhOAS0niZiWYa5p5iLGguJk3TIWm03dNDfSHWRoNAkkT6FNtME6MhhLJOVAnxUx5aoIyosrujsWvIpLGXlVyJWVSxESZniJmWPQmKI1Err26ESH0n0gI4672V+6TWLDoq43Y/x11LeA6H+d+UBiYIbthgU6nWBLdnITihmc8Z5o2tSRnRF5zKwJiiYAWTbh3K1HcbEcqJEeM8KCACLwZNEuavSvCkP/v7DXlpgeNglCKCSvdhtIQJ1VsQWcSZTFNLOyP6RBGWjRHYhRLoJUTPRHRtxGJEbATFDyq1FHZNvje4nB4DGlyA5JRADlNcAI5TCi+ae4mgafIuzYXRRSsGN4jd3Lv3kk+CpVgcM48AXcLQ/OggHR0JoAMCEZAOTfpFuPqeBchy0lZLi0RVRn4WEZwdJ+fvjx95R69+O745Pj8+PRk2B0DL5BwjTxCcvq9JCQtC5mEZdHb9fc9uKoOzOLvG87X4S9pMtE/E8bE7Nahb6DVMmjHQ/z5LfrQAOvFBi9yWWGrTOqlNfAzU+YSPNivU6jWQSvecvAraL0etDq8gpm5LbPmZbHa8GcM2rjEAgReGzIkGUVlgfl8ajAQ8WvPh7/DXmnFkhRPq62XWHe3P4Bl8A0se30LfjmhelkVFGNeC3GWy4yRtYZBaz0DSskMC87hTirGsyrMCM7w2Axe7Vkri22/tNDVhFngSteClVCKBTWqFqSQKE024i60LfikmlwLvjbSbXmbPlkei+ZtopkZgpb0V8Ox6aOqNlsOi9635GN63bLDagZaq563ZlQfnk2HCBhpc1oduE2frJ4Zy2M5SRV+x31xclRfMMMufPIsv2vcu4XXe7s+yD/520/7i4+fLecL50vnieM7h85T5wfnzLlwsJM4vzt/On913nZ+6fza+a2G3r+38PncaTydP/4FwHKMDg==</latexit>
53 / 174
![Page 63: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/63.jpg)
Rademacher Averages
Then, conditionally on Z n1 ,
Eε∥∥∥Rn
∥∥∥H
= Eε supg∈G
∣∣∣∣∣1nn∑
i=1
εig(Zi )
∣∣∣∣∣is a data-dependent quantity (hence the “hat”) that measures complexityof G projected onto data.
Also define the Rademacher process Rn (without the “hat”)
Rn(g) =1
n
n∑i=1
εig(Zi ),
and Rademacher complexity of G as
E‖Rn‖G = EZ n1Eε∥∥∥Rn
∥∥∥G(Z n
1 ).
54 / 174
![Page 64: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/64.jpg)
Uniform Laws and Rademacher Complexity
We’ll look at a proof of a uniform law of large numbers that involves twosteps:
1. Symmetrization, which bounds E‖P − Pn‖G in terms of theRademacher complexity of F , E‖Rn‖G .
2. Both ‖P − Pn‖G and ‖Rn‖G are concentrated about theirexpectation.
55 / 174
![Page 65: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/65.jpg)
Uniform Laws and Rademacher Complexity
For any G,E‖P − Pn‖G ≤ 2E‖Rn‖G
If G ⊂ [0, 1]Z , then
1
2E‖Rn‖G −
√log 2
2n≤ E‖P − Pn‖G ≤ 2E‖Rn‖G ,
and, with probability at least 1− 2 exp(−2ε2n),
E‖P − Pn‖G − ε ≤ ‖P − Pn‖G ≤ E‖P − Pn‖G + ε.
Thus, E‖Rn‖G → 0 iff ‖P − Pn‖G as→ 0.
Theorem.
That is, the supremum of the empirical process P − Pn is concentratedabout its expectation, and its expectation is about the same as theexpected sup of the Rademacher process Rn.
56 / 174
![Page 66: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/66.jpg)
Uniform Laws and Rademacher Complexity
The first step is to symmetrize by replacing Pg by P ′ng = 1n
∑ni=1 g(Z ′i ).
In particular, we have
E‖P − Pn‖G = E supg∈G
∣∣∣∣∣E[
1
n
n∑i=1
(g(Z ′i )− g(Zi ))
∣∣∣∣∣Z n1
]∣∣∣∣∣
≤ EE
[supg∈G
∣∣∣∣∣1nn∑
i=1
(g(Z ′i )− g(Zi ))
∣∣∣∣∣∣∣∣∣∣Z n
1
]
= E supg∈G
∣∣∣∣∣1nn∑
i=1
(g(Z ′i )− g(Zi ))
∣∣∣∣∣ = E‖P ′n − Pn‖G .
57 / 174
![Page 67: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/67.jpg)
Uniform Laws and Rademacher Complexity
The first step is to symmetrize by replacing Pg by P ′ng = 1n
∑ni=1 g(Z ′i ).
In particular, we have
E‖P − Pn‖G = E supg∈G
∣∣∣∣∣E[
1
n
n∑i=1
(g(Z ′i )− g(Zi ))
∣∣∣∣∣Z n1
]∣∣∣∣∣≤ EE
[supg∈G
∣∣∣∣∣1nn∑
i=1
(g(Z ′i )− g(Zi ))
∣∣∣∣∣∣∣∣∣∣Z n
1
]
= E supg∈G
∣∣∣∣∣1nn∑
i=1
(g(Z ′i )− g(Zi ))
∣∣∣∣∣ = E‖P ′n − Pn‖G .
57 / 174
![Page 68: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/68.jpg)
Uniform Laws and Rademacher Complexity
The first step is to symmetrize by replacing Pg by P ′ng = 1n
∑ni=1 g(Z ′i ).
In particular, we have
E‖P − Pn‖G = E supg∈G
∣∣∣∣∣E[
1
n
n∑i=1
(g(Z ′i )− g(Zi ))
∣∣∣∣∣Z n1
]∣∣∣∣∣≤ EE
[supg∈G
∣∣∣∣∣1nn∑
i=1
(g(Z ′i )− g(Zi ))
∣∣∣∣∣∣∣∣∣∣Z n
1
]
= E supg∈G
∣∣∣∣∣1nn∑
i=1
(g(Z ′i )− g(Zi ))
∣∣∣∣∣ = E‖P ′n − Pn‖G .
57 / 174
![Page 69: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/69.jpg)
Uniform Laws and Rademacher Complexity
Another symmetrization: for any εi ∈ ±1,
E‖P ′n − Pn‖G = E supg∈G
∣∣∣∣∣1nn∑
i=1
(g(Z ′i )− g(Zi ))
∣∣∣∣∣= E sup
g∈G
∣∣∣∣∣1nn∑
i=1
εi (g(Z ′i )− g(Zi ))
∣∣∣∣∣ ,This follows from the fact that Zi and Z ′i are i.i.d., and so thedistribution of the supremum is unchanged when we swap them.And so in particular the expectation of the supremum is unchanged.And since this is true for any εi , we can take the expectation over anyrandom choice of the εi . We’ll pick them independently and uniformly.
58 / 174
![Page 70: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/70.jpg)
Uniform Laws and Rademacher Complexity
E supg∈G
∣∣∣∣∣1nn∑
i=1
εi (g(Z ′i )− g(Zi ))
∣∣∣∣∣≤ E sup
g∈G
∣∣∣∣∣1nn∑
i=1
εig(Z ′i )
∣∣∣∣∣+ supg∈G
∣∣∣∣∣1nn∑
i=1
εig(Zi )
∣∣∣∣∣= 2E sup
g∈G
∣∣∣∣∣1nn∑
i=1
εig(Zi )
∣∣∣∣∣︸ ︷︷ ︸Rademacher complexity
= 2E‖Rn‖G ,
where Rn is the Rademacher process Rn(g) = (1/n)∑n
i=1 εig(Zi ).
59 / 174
![Page 71: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/71.jpg)
Uniform Laws and Rademacher Complexity
The second inequality (desymmetrization) follows from:
E‖Rn‖G ≤ E
∥∥∥∥∥1
n
n∑i=1
εi (g(Zi )− Eg(Zi ))
∥∥∥∥∥G
+ E
∥∥∥∥∥1
n
n∑i=1
εiEg(Zi )
∥∥∥∥∥G
≤ E
∥∥∥∥∥1
n
n∑i=1
εi (g(Zi )− g(Z ′i ))
∥∥∥∥∥G
+ ‖P‖G E∣∣∣∣∣1n
n∑i=1
εi
∣∣∣∣∣= E
∥∥∥∥∥1
n
n∑i=1
(g(Zi )− Eg(Zi ) + Eg(Z ′i )− g(Z ′i ))
∥∥∥∥∥G
+ ‖P‖G E∣∣∣∣∣1n
n∑i=1
εi
∣∣∣∣∣≤ 2E ‖Pn − P‖G +
√2 log 2
n.
60 / 174
![Page 72: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/72.jpg)
Uniform Laws and Rademacher Complexity
For any G,E‖P − Pn‖G ≤ 2E‖Rn‖G
If G ⊂ [0, 1]Z , then
1
2E‖Rn‖G −
√log 2
2n≤ E‖P − Pn‖G ≤ 2E‖Rn‖G ,
and, with probability at least 1− 2 exp(−2ε2n),
E‖P − Pn‖G − ε ≤ ‖P − Pn‖G ≤ E‖P − Pn‖G + ε.
Thus, E‖Rn‖G → 0 iff ‖P − Pn‖G as→ 0.
Theorem.
That is, the supremum of the empirical process P − Pn is concentratedabout its expectation, and its expectation is about the same as theexpected sup of the Rademacher process Rn.
61 / 174
![Page 73: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/73.jpg)
Back to the prediction problem
Recall: for prediction problem, we have G = ` F for some loss function` and a set of functions F .
We found that E ‖P − Pn‖`F can be related to E ‖Rn‖`F . Question:can we get E ‖Rn‖F instead?
For classification, this is easy. Suppose F ⊆ ±1X , Y ∈ ±1, and`(y , y ′) = Iy 6= y ′ = (1− yy ′)/2. Then
E ‖Rn‖`F = E supf∈F
∣∣∣∣∣1nn∑
i=1
εi`(yi , f (xi ))
∣∣∣∣∣=
1
2E sup
f∈F
∣∣∣∣∣1nn∑
i=1
εi +1
n
n∑i=1
−εiyi f (xi )
∣∣∣∣∣≤ O
(1√n
)+
1
2E ‖Rn‖F .
Other loss functions?
62 / 174
![Page 74: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/74.jpg)
Outline
Introduction
Uniform Laws of Large NumbersMotivationRademacher AveragesRademacher Complexity: Structural ResultsSigmoid NetworksReLU NetworksCovering Numbers
Classification
Beyond ERM
Interpolation
63 / 174
![Page 75: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/75.jpg)
Rademacher Complexity: Structural Results
Subsets F ⊆ G implies ‖Rn‖F ≤ ‖Rn‖G .
Scaling ‖Rn‖cF = |c |‖Rn‖F .
Plus Constant For |g(X )| ≤ 1,|E‖Rn‖F+g − E‖Rn‖F | ≤
√2 log 2/n.
Convex Hull ‖Rn‖coF = ‖Rn‖F , where coF is the convex hullof F .
Contraction If ` : R× R→ [−1, 1] has y 7→ `(y , y) 1-Lipschitzfor all y , then for ` F = (x , y) 7→ `(f (x), y),E‖Rn‖`F ≤ 2E‖Rn‖F + c/
√n.
Theorem.
64 / 174
![Page 76: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/76.jpg)
Outline
Introduction
Uniform Laws of Large NumbersMotivationRademacher AveragesRademacher Complexity: Structural ResultsSigmoid NetworksReLU NetworksCovering Numbers
Classification
Beyond ERM
Interpolation
65 / 174
![Page 77: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/77.jpg)
Deep Networks
Deep compositions of nonlinear functions
h = hm hm−1 · · · h1
e.g., hi : x 7→ σ(Wix) hi : x 7→ r(Wix)
σ(v)i =1
1 + exp(−vi ), r(v)i = max0, vi
66 / 174
![Page 78: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/78.jpg)
Deep Networks
Deep compositions of nonlinear functions
h = hm hm−1 · · · h1
e.g., hi : x 7→ σ(Wix)
hi : x 7→ r(Wix)
σ(v)i =1
1 + exp(−vi ),
r(v)i = max0, vi
66 / 174
![Page 79: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/79.jpg)
Deep Networks
Deep compositions of nonlinear functions
h = hm hm−1 · · · h1
e.g., hi : x 7→ σ(Wix)
hi : x 7→ r(Wix)
σ(v)i =1
1 + exp(−vi ),
r(v)i = max0, vi
66 / 174
![Page 80: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/80.jpg)
Deep Networks
Deep compositions of nonlinear functions
h = hm hm−1 · · · h1
e.g., hi : x 7→ σ(Wix) hi : x 7→ r(Wix)
σ(v)i =1
1 + exp(−vi ), r(v)i = max0, vi
66 / 174
![Page 81: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/81.jpg)
Deep Networks
Deep compositions of nonlinear functions
h = hm hm−1 · · · h1
e.g., hi : x 7→ σ(Wix) hi : x 7→ r(Wix)
σ(v)i =1
1 + exp(−vi ), r(v)i = max0, vi
66 / 174
![Page 82: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/82.jpg)
Rademacher Complexity of Sigmoid Networks
Consider the following class FB,d of two-layer neural networks onRd :
FB,d =
x 7→
k∑i=1
wiσ(vTi x)
: ‖w‖1 ≤ B, ‖vi‖1 ≤ B, k ≥ 1
,
where B > 0 and the nonlinear function σ : R → R satisfies theLipschitz condition, |σ(a)−σ(b)| ≤ |a−b|, and σ(0) = 0. Supposethat the distribution is such that ‖X‖∞ ≤ 1 a.s. Then
E‖Rn‖FB,d≤ 2B2
√2 log 2d
n,
where d is the dimension of the input space, X = Rd .
Theorem.
67 / 174
![Page 83: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/83.jpg)
Rademacher Complexity of Sigmoid Networks: Proof
Define G := (x1, . . . , xd) 7→ xj : 1 ≤ j ≤ d,
VB :=
x 7→ v ′x : ‖v‖1 =
d∑i=1
|vi | ≤ B
= Bco (G ∪ −G) ,
with co(F) :=
k∑
i=1
αi fi : k ≥ 1, αi ≥ 0, ‖α‖1 = 1, fi ∈ F.
Then FB,d =
x 7→
k∑i=1
wiσ(vi (x)) | k ≥ 1,k∑
i=1
wi ≤ B, vi ∈ VB
= Bco (σ VB ∪ −σ VB) .
68 / 174
![Page 84: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/84.jpg)
Rademacher Complexity of Sigmoid Networks: Proof
Define G := (x1, . . . , xd) 7→ xj : 1 ≤ j ≤ d,
VB :=
x 7→ v ′x : ‖v‖1 =
d∑i=1
|vi | ≤ B
= Bco (G ∪ −G) ,
with co(F) :=
k∑
i=1
αi fi : k ≥ 1, αi ≥ 0, ‖α‖1 = 1, fi ∈ F.
Then FB,d =
x 7→
k∑i=1
wiσ(vi (x)) | k ≥ 1,k∑
i=1
wi ≤ B, vi ∈ VB
= Bco (σ VB ∪ −σ VB) .
68 / 174
![Page 85: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/85.jpg)
Rademacher Complexity of Sigmoid Networks: Proof
E‖Rn‖FB,d= E‖Rn‖Bco(σVB∪−σVB )
= BE‖Rn‖co(σVB∪−σVB )
= BE‖Rn‖σVB∪−σVB≤ BE‖Rn‖σVB + BE‖Rn‖−σVB= 2BE‖Rn‖σVB≤ 2BE‖Rn‖VB= 2BE‖Rn‖Bco(G∪−G)
= 2B2E‖Rn‖G∪−G
≤ 2B2
√2 log (2d)
n.
69 / 174
![Page 86: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/86.jpg)
Rademacher Complexity of Sigmoid Networks: Proof
E‖Rn‖FB,d= E‖Rn‖Bco(σVB∪−σVB )
= BE‖Rn‖co(σVB∪−σVB )
= BE‖Rn‖σVB∪−σVB
≤ BE‖Rn‖σVB + BE‖Rn‖−σVB= 2BE‖Rn‖σVB≤ 2BE‖Rn‖VB= 2BE‖Rn‖Bco(G∪−G)
= 2B2E‖Rn‖G∪−G
≤ 2B2
√2 log (2d)
n.
69 / 174
![Page 87: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/87.jpg)
Rademacher Complexity of Sigmoid Networks: Proof
E‖Rn‖FB,d= E‖Rn‖Bco(σVB∪−σVB )
= BE‖Rn‖co(σVB∪−σVB )
= BE‖Rn‖σVB∪−σVB≤ BE‖Rn‖σVB + BE‖Rn‖−σVB
= 2BE‖Rn‖σVB≤ 2BE‖Rn‖VB= 2BE‖Rn‖Bco(G∪−G)
= 2B2E‖Rn‖G∪−G
≤ 2B2
√2 log (2d)
n.
69 / 174
![Page 88: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/88.jpg)
Rademacher Complexity of Sigmoid Networks: Proof
E‖Rn‖FB,d= E‖Rn‖Bco(σVB∪−σVB )
= BE‖Rn‖co(σVB∪−σVB )
= BE‖Rn‖σVB∪−σVB≤ BE‖Rn‖σVB + BE‖Rn‖−σVB= 2BE‖Rn‖σVB
≤ 2BE‖Rn‖VB= 2BE‖Rn‖Bco(G∪−G)
= 2B2E‖Rn‖G∪−G
≤ 2B2
√2 log (2d)
n.
69 / 174
![Page 89: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/89.jpg)
Rademacher Complexity of Sigmoid Networks: Proof
E‖Rn‖FB,d= E‖Rn‖Bco(σVB∪−σVB )
= BE‖Rn‖co(σVB∪−σVB )
= BE‖Rn‖σVB∪−σVB≤ BE‖Rn‖σVB + BE‖Rn‖−σVB= 2BE‖Rn‖σVB≤ 2BE‖Rn‖VB
= 2BE‖Rn‖Bco(G∪−G)
= 2B2E‖Rn‖G∪−G
≤ 2B2
√2 log (2d)
n.
69 / 174
![Page 90: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/90.jpg)
Rademacher Complexity of Sigmoid Networks: Proof
E‖Rn‖FB,d= E‖Rn‖Bco(σVB∪−σVB )
= BE‖Rn‖co(σVB∪−σVB )
= BE‖Rn‖σVB∪−σVB≤ BE‖Rn‖σVB + BE‖Rn‖−σVB= 2BE‖Rn‖σVB≤ 2BE‖Rn‖VB= 2BE‖Rn‖Bco(G∪−G)
= 2B2E‖Rn‖G∪−G
≤ 2B2
√2 log (2d)
n.
69 / 174
![Page 91: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/91.jpg)
Rademacher Complexity of Sigmoid Networks: Proof
E‖Rn‖FB,d= E‖Rn‖Bco(σVB∪−σVB )
= BE‖Rn‖co(σVB∪−σVB )
= BE‖Rn‖σVB∪−σVB≤ BE‖Rn‖σVB + BE‖Rn‖−σVB= 2BE‖Rn‖σVB≤ 2BE‖Rn‖VB= 2BE‖Rn‖Bco(G∪−G)
= 2B2E‖Rn‖G∪−G
≤ 2B2
√2 log (2d)
n.
69 / 174
![Page 92: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/92.jpg)
Rademacher Complexity of Sigmoid Networks: Proof
E‖Rn‖FB,d= E‖Rn‖Bco(σVB∪−σVB )
= BE‖Rn‖co(σVB∪−σVB )
= BE‖Rn‖σVB∪−σVB≤ BE‖Rn‖σVB + BE‖Rn‖−σVB= 2BE‖Rn‖σVB≤ 2BE‖Rn‖VB= 2BE‖Rn‖Bco(G∪−G)
= 2B2E‖Rn‖G∪−G
≤ 2B2
√2 log (2d)
n.
69 / 174
![Page 93: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/93.jpg)
Rademacher Complexity of Sigmoid Networks: Proof
The final step involved the following result.
For f ∈ F satisfying |f (x)| ≤ 1,
E‖Rn‖F ≤ E√
2 log(2|F(X n1 )|)
n.
Lemma.
70 / 174
![Page 94: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/94.jpg)
Outline
Introduction
Uniform Laws of Large NumbersMotivationRademacher AveragesRademacher Complexity: Structural ResultsSigmoid NetworksReLU NetworksCovering Numbers
Classification
Beyond ERM
Interpolation
71 / 174
![Page 95: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/95.jpg)
Rademacher Complexity of ReLU Networks
Using the positive homogeneity property of the ReLU nonlinearity (thatis, for all α ≥ 0 and x ∈ R, σ(αx) = ασ(x)) gives a simple argument tobound the Rademacher complexity.
For a training set (X1,Y1), . . . , (Xn,Yn) ∈ X×±1 with ‖Xi‖ ≤ 1a.s.,
E ‖Rn‖FF,B≤ (2B)L√
n,
where f ∈ FF ,B is an L-layer network of the form
FF ,B := WLσ(WL−1 · · ·σ(W1x) · · · ),
σ is 1-Lipschitz, positive homogeneous (that is, for all α ≥ 0and x ∈ R, σ(αx) = ασ(x)), and applied componentwise, and‖Wi‖F ≤ B. (WL is a row vector.)
Theorem.
72 / 174
![Page 96: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/96.jpg)
ReLU Networks: Proof
(Write Eε as the conditional expectation given the data.)
Eε supf∈F,‖W‖F≤B
1
n
∥∥∥∥∥n∑
i=1
εiσ(Wf (Xi ))
∥∥∥∥∥2
≤ 2BEε supf∈F
1
n
∥∥∥∥∥n∑
i=1
εi f (Xi )
∥∥∥∥∥2
.
Lemma.
Iterating this and using Jensen’s inequality proves the theorem.
73 / 174
![Page 97: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/97.jpg)
ReLU Networks: Proof
For W> = (w1 · · ·wk), we use positive homogeneity:∥∥∥∥∥n∑
i=1
εiσ(Wf (xi ))
∥∥∥∥∥2
=k∑
j=1
(n∑
i=1
εiσ(w>j f (xi ))
)2
=k∑
j=1
‖wj‖2
(n∑
i=1
εiσ
(w>j‖wj‖
f (xi )
))2
.
74 / 174
![Page 98: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/98.jpg)
ReLU Networks: Proof
sup‖W‖F≤B
k∑j=1
‖wj‖2
(n∑
i=1
εiσ
(w>j‖wj‖
f (xi )
))2
= sup‖wj‖=1;‖α‖1≤B2
k∑j=1
αj
(n∑
i=1
εiσ(w>j f (xi )
))2
= B2 sup‖w‖=1
(n∑
i=1
εiσ(w>f (xi )
))2
,
then apply (Contraction) and Cauchy-Schwartz inequalities.
75 / 174
![Page 99: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/99.jpg)
Outline
Introduction
Uniform Laws of Large NumbersMotivationRademacher AveragesRademacher Complexity: Structural ResultsSigmoid NetworksReLU NetworksCovering Numbers
Classification
Beyond ERM
Interpolation
76 / 174
![Page 100: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/100.jpg)
Rademacher Complexity and Covering Numbers
An ε-cover of a subset T of a metric space (S , d) is a set T ⊂ Tsuch that for each t ∈ T there is a t ∈ T such that d(t, t) ≤ ε.The ε-covering number of T is
N (ε,T , d) = min|T | : T is an ε-cover of T.
Definition.
Intuition: A k-dimensional set T has N (ε,T , k) = Θ(1/εk).Example: ([0, 1]k , l∞).
77 / 174
![Page 101: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/101.jpg)
Rademacher Complexity and Covering Numbers
For any F and any X1, . . . ,Xn,
Eε‖Rn‖F(X n1 ) ≤ c
∫ ∞0
√lnN (ε,F(X n
1 ), `2)
ndε.
and
Eε‖Rn‖F(X n1 ) ≥
1
c log nsupα>0
α
√lnN (α,F(X n
1 ), `2)
n.
Theorem.
78 / 174
![Page 102: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/102.jpg)
Covering numbers and smoothly parameterized functions
Let F be a parameterized class of functions,
F = f (θ, ·) : θ ∈ Θ.
Let ‖ · ‖Θ be a norm on Θ and let ‖ · ‖F be a norm on F . Supposethat the mapping θ 7→ f (θ, ·) is L-Lipschitz, that is,
‖f (θ, ·)− f (θ′, ·)‖F ≤ L‖θ − θ′‖Θ.
Then N (ε,F , ‖ · ‖F ) ≤ N (ε/L,Θ, ‖ · ‖Θ).
Lemma.
79 / 174
![Page 103: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/103.jpg)
Covering numbers and smoothly parameterized functions
A Lipschitz parameterization allows us to translate a cover of theparameter space into a cover of the function space.
Example: If F is smoothly parameterized by (a compact set of) kparameters, then N (ε,F) = O(1/εk).
80 / 174
![Page 104: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/104.jpg)
Recap
g0
Eg
all functions
g0
Eg
all functions
1
n
nX
i=1
g(zi)
g0
Eg
all functions
1
n
nX
i=1
g(zi)
g0
Eg
all functions
1
n
nX
i=1
g(zi)
gn g0
1
n
nX
i=1
g(zi)
gn
G
g0
Eg
all functions
1
n
nX
i=1
g(zi)G
g0
Eg
all functions
1
n
nX
i=1
g(zi)
gG gn
G
g0
Eg
all functions
1
n
nX
i=1
g(zi)
81 / 174
![Page 105: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/105.jpg)
Recap
g0
Eg
all functions
g0
Eg
all functions
1
n
nX
i=1
g(zi)
g0
Eg
all functions
1
n
nX
i=1
g(zi)
g0
Eg
all functions
1
n
nX
i=1
g(zi)
gn g0
1
n
nX
i=1
g(zi)
gn
G
g0
Eg
all functions
1
n
nX
i=1
g(zi)G
g0
Eg
all functions
1
n
nX
i=1
g(zi)
gG gn
G
g0
Eg
all functions
1
n
nX
i=1
g(zi)
81 / 174
![Page 106: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/106.jpg)
Recap
g0
Eg
all functionsg0
Eg
all functions
1
n
nX
i=1
g(zi)
g0
Eg
all functions
1
n
nX
i=1
g(zi)
g0
Eg
all functions
1
n
nX
i=1
g(zi)
gn g0
1
n
nX
i=1
g(zi)
gn
G
g0
Eg
all functions
1
n
nX
i=1
g(zi)G
g0
Eg
all functions
1
n
nX
i=1
g(zi)
gG gn
G
g0
Eg
all functions
1
n
nX
i=1
g(zi)
81 / 174
![Page 107: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/107.jpg)
Recap
g0
Eg
all functionsg0
Eg
all functions
1
n
nX
i=1
g(zi)
g0
Eg
all functions
1
n
nX
i=1
g(zi)
g0
Eg
all functions
1
n
nX
i=1
g(zi)
gn
g0
1
n
nX
i=1
g(zi)
gn
G
g0
Eg
all functions
1
n
nX
i=1
g(zi)G
g0
Eg
all functions
1
n
nX
i=1
g(zi)
gG gn
G
g0
Eg
all functions
1
n
nX
i=1
g(zi)
81 / 174
![Page 108: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/108.jpg)
Recap
g0
Eg
all functionsg0
Eg
all functions
1
n
nX
i=1
g(zi)
g0
Eg
all functions
1
n
nX
i=1
g(zi)
g0
Eg
all functions
1
n
nX
i=1
g(zi)
gn
g0
1
n
nX
i=1
g(zi)
gn
G
g0
Eg
all functions
1
n
nX
i=1
g(zi)G
g0
Eg
all functions
1
n
nX
i=1
g(zi)
gG gn
G
g0
Eg
all functions
1
n
nX
i=1
g(zi)
81 / 174
![Page 109: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/109.jpg)
Recap
g0
Eg
all functionsg0
Eg
all functions
1
n
nX
i=1
g(zi)
g0
Eg
all functions
1
n
nX
i=1
g(zi)
g0
Eg
all functions
1
n
nX
i=1
g(zi)
gn g0
1
n
nX
i=1
g(zi)
gn
G
g0
Eg
all functions
1
n
nX
i=1
g(zi)
G
g0
Eg
all functions
1
n
nX
i=1
g(zi)
gG gn
G
g0
Eg
all functions
1
n
nX
i=1
g(zi)
81 / 174
![Page 110: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/110.jpg)
Recap
g0
Eg
all functionsg0
Eg
all functions
1
n
nX
i=1
g(zi)
g0
Eg
all functions
1
n
nX
i=1
g(zi)
g0
Eg
all functions
1
n
nX
i=1
g(zi)
gn g0
1
n
nX
i=1
g(zi)
gn
G
g0
Eg
all functions
1
n
nX
i=1
g(zi)
G
g0
Eg
all functions
1
n
nX
i=1
g(zi)
gG gn
G
g0
Eg
all functions
1
n
nX
i=1
g(zi)
81 / 174
![Page 111: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/111.jpg)
Recap
g0
Eg
all functionsg0
Eg
all functions
1
n
nX
i=1
g(zi)
g0
Eg
all functions
1
n
nX
i=1
g(zi)
g0
Eg
all functions
1
n
nX
i=1
g(zi)
gn g0
1
n
nX
i=1
g(zi)
gn
G
g0
Eg
all functions
1
n
nX
i=1
g(zi)G
g0
Eg
all functions
1
n
nX
i=1
g(zi)
gG gn
G
g0
Eg
all functions
1
n
nX
i=1
g(zi)
81 / 174
![Page 112: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/112.jpg)
Recap
E ‖P − Pn‖F E ‖Rn‖F
Advantage of dealing with Rademacher process:
I often easier to analyze
I can work conditionally on the data
Example: F = x 7→ 〈w , x〉 : ‖w‖ ≤ 1, ‖x‖ ≤ 1. Then
Eε∥∥∥Rn
∥∥∥F
= Eε sup‖w‖≤1
∣∣∣∣∣1nn∑
i=1
εi 〈w , xi 〉∣∣∣∣∣ = Eε sup
‖w‖≤1
∣∣∣∣∣⟨w ,
1
n
n∑i=1
εixi
⟩∣∣∣∣∣which is equal to
Eε
∥∥∥∥∥1
n
n∑i=1
εixi
∥∥∥∥∥ ≤√∑n
i=1 ‖xi‖2
n=
√tr(XX>)
n
82 / 174
![Page 113: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/113.jpg)
Outline
Introduction
Uniform Laws of Large Numbers
ClassificationGrowth FunctionVC-DimensionVC-Dimension of Neural NetworksLarge Margin Classifiers
Beyond ERM
Interpolation
83 / 174
![Page 114: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/114.jpg)
Classification and Rademacher Complexity
Recall:
For F ⊆ ±1X , Y ∈ ±1, and `(y , y ′) = Iy 6= y ′ = (1− yy ′)/2,
E ‖Rn‖`F ≤1
2E ‖Rn‖F + O
(1√n
).
How do we get a handle on Rademacher complexity of a set F ofclassifiers? Recall that we are looking at “size” of F projected onto data:
F(xn1 ) = (f (x1), . . . , f (xn)) : f ∈ F ⊆ −1, 1n
84 / 174
![Page 115: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/115.jpg)
Controlling Rademacher Complexity: Growth Function
Example: For the class of step functions, F = x 7→ Ix ≤ α : α ∈ R,the set of restrictions,
F(xn1 ) = (f (x1), . . . , f (xn)) : f ∈ F
is always small: |F(xn1 )| ≤ n + 1. So E‖Rn‖F ≤√
2 log 2(n+1)n .
Example: If F is parameterized by k bits,F =
x 7→ f (x , θ) : θ ∈ 0, 1k
for some f : X × 0, 1k → [0, 1],
|F(xn1 )| ≤ 2k ,
E‖Rn‖F ≤√
2(k + 1) log 2
n.
85 / 174
![Page 116: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/116.jpg)
Controlling Rademacher Complexity: Growth Function
Example: For the class of step functions, F = x 7→ Ix ≤ α : α ∈ R,the set of restrictions,
F(xn1 ) = (f (x1), . . . , f (xn)) : f ∈ F
is always small: |F(xn1 )| ≤ n + 1. So E‖Rn‖F ≤√
2 log 2(n+1)n .
Example: If F is parameterized by k bits,F =
x 7→ f (x , θ) : θ ∈ 0, 1k
for some f : X × 0, 1k → [0, 1],
|F(xn1 )| ≤ 2k ,
E‖Rn‖F ≤√
2(k + 1) log 2
n.
85 / 174
![Page 117: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/117.jpg)
Growth Function
For a class F ⊆ 0, 1X , the growth function is
ΠF (n) = max|F(xn1 )| : x1, . . . , xn ∈ X.
Definition.
I E‖Rn‖F ≤√
2 log(2ΠF (n))n ; E‖P − Pn‖F = O
(√log(ΠF (n))
n
).
I ΠF (n) ≤ |F|, limn→∞ ΠF (n) = |F|.I ΠF (n) ≤ 2n. (But then this gives no useful bound on E‖Rn‖F .)
86 / 174
![Page 118: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/118.jpg)
Growth Function
For a class F ⊆ 0, 1X , the growth function is
ΠF (n) = max|F(xn1 )| : x1, . . . , xn ∈ X.
Definition.
I E‖Rn‖F ≤√
2 log(2ΠF (n))n ; E‖P − Pn‖F = O
(√log(ΠF (n))
n
).
I ΠF (n) ≤ |F|, limn→∞ ΠF (n) = |F|.I ΠF (n) ≤ 2n. (But then this gives no useful bound on E‖Rn‖F .)
86 / 174
![Page 119: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/119.jpg)
Growth Function
For a class F ⊆ 0, 1X , the growth function is
ΠF (n) = max|F(xn1 )| : x1, . . . , xn ∈ X.
Definition.
I E‖Rn‖F ≤√
2 log(2ΠF (n))n ; E‖P − Pn‖F = O
(√log(ΠF (n))
n
).
I ΠF (n) ≤ |F|, limn→∞ ΠF (n) = |F|.
I ΠF (n) ≤ 2n. (But then this gives no useful bound on E‖Rn‖F .)
86 / 174
![Page 120: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/120.jpg)
Growth Function
For a class F ⊆ 0, 1X , the growth function is
ΠF (n) = max|F(xn1 )| : x1, . . . , xn ∈ X.
Definition.
I E‖Rn‖F ≤√
2 log(2ΠF (n))n ; E‖P − Pn‖F = O
(√log(ΠF (n))
n
).
I ΠF (n) ≤ |F|, limn→∞ ΠF (n) = |F|.I ΠF (n) ≤ 2n. (But then this gives no useful bound on E‖Rn‖F .)
86 / 174
![Page 121: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/121.jpg)
Outline
Introduction
Uniform Laws of Large Numbers
ClassificationGrowth FunctionVC-DimensionVC-Dimension of Neural NetworksLarge Margin Classifiers
Beyond ERM
Interpolation
87 / 174
![Page 122: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/122.jpg)
Vapnik-Chervonenkis Dimension
A class F ⊆ 0, 1X shatters x1, . . . , xd ⊆ X means that|F(xd1 )| = 2d . The Vapnik-Chervonenkis dimension of F is
dVC (F) = max d : some x1, . . . , xd ∈ X is shattered by F= max
d : ΠF (d) = 2d
.
Definition.
88 / 174
![Page 123: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/123.jpg)
Vapnik-Chervonenkis Dimension: “Sauer’s Lemma”
dVC (F) ≤ d implies
ΠF (n) ≤d∑
i=0
(n
i
).
If n ≥ d , the latter sum is no more than(end
)d.
Theorem (Vapnik-Chervonenkis).
So the VC-dimension d is a single integer summary of the growthfunction:
ΠF (n)
= 2n if n ≤ d ,
≤ (e/d)d nd if n > d .
89 / 174
![Page 124: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/124.jpg)
Vapnik-Chervonenkis Dimension: “Sauer’s Lemma”
dVC (F) ≤ d implies
ΠF (n) ≤d∑
i=0
(n
i
).
If n ≥ d , the latter sum is no more than(end
)d.
Theorem (Vapnik-Chervonenkis).
So the VC-dimension d is a single integer summary of the growthfunction:
ΠF (n)
= 2n if n ≤ d ,
≤ (e/d)d nd if n > d .
89 / 174
![Page 125: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/125.jpg)
Vapnik-Chervonenkis Dimension: “Sauer’s Lemma”
Thus, for dVC (F) ≤ d and n ≥ d ,
supP
E‖P − Pn‖F = O
(√d log(n/d)
n
).
And there is a matching lower bound.
Uniform convergence uniformly over probability distributions is equivalentto finiteness of the VC-dimension.
90 / 174
![Page 126: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/126.jpg)
VC-dimension Bounds for Parameterized Families
Consider a parameterized class of binary-valued functions,
F = x 7→ f (x , θ) : θ ∈ Rp ,
where f : X × Rp → ±1.
Suppose that for each x , θ 7→ f (x , θ) can be computed using no morethan t operations of the following kinds:
1. arithmetic (+, −, ×, /),
2. comparisons (>, =, <),
3. output ±1.
dVC (F) ≤ 4p(t + 2).
Theorem (Goldberg and Jerrum).
91 / 174
![Page 127: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/127.jpg)
Outline
Introduction
Uniform Laws of Large Numbers
ClassificationGrowth FunctionVC-DimensionVC-Dimension of Neural NetworksLarge Margin Classifiers
Beyond ERM
Interpolation
92 / 174
![Page 128: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/128.jpg)
VC-Dimension of Neural Networks
Consider the class F of −1, 1-valued functions computed by anetwork with L layers, p parameters, and k computation units withthe following nonlinearities:
1. Piecewise constant (linear threshold units):VCdim(F) = Θ (p).
(Baum and Haussler, 1989)
2. Piecewise linear (ReLUs): VCdim(F) = Θ (pL).(B., Harvey, Liaw, Mehrabian, 2017)
3. Piecewise polynomial: VCdim(F) = O(pL2).
(B., Maiorov, Meir, 1998)
4. Sigmoid: VCdim(F) = O(p2k2
).
(Karpinsky and MacIntyre, 1994)
Theorem.
93 / 174
![Page 129: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/129.jpg)
VC-Dimension of Neural Networks
Consider the class F of −1, 1-valued functions computed by anetwork with L layers, p parameters, and k computation units withthe following nonlinearities:
1. Piecewise constant (linear threshold units):VCdim(F) = Θ (p).
(Baum and Haussler, 1989)
2. Piecewise linear (ReLUs): VCdim(F) = Θ (pL).(B., Harvey, Liaw, Mehrabian, 2017)
3. Piecewise polynomial: VCdim(F) = O(pL2).
(B., Maiorov, Meir, 1998)
4. Sigmoid: VCdim(F) = O(p2k2
).
(Karpinsky and MacIntyre, 1994)
Theorem.
93 / 174
![Page 130: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/130.jpg)
VC-Dimension of Neural Networks
Consider the class F of −1, 1-valued functions computed by anetwork with L layers, p parameters, and k computation units withthe following nonlinearities:
1. Piecewise constant (linear threshold units):VCdim(F) = Θ (p).
(Baum and Haussler, 1989)
2. Piecewise linear (ReLUs): VCdim(F) = Θ (pL).(B., Harvey, Liaw, Mehrabian, 2017)
3. Piecewise polynomial: VCdim(F) = O(pL2).
(B., Maiorov, Meir, 1998)
4. Sigmoid: VCdim(F) = O(p2k2
).
(Karpinsky and MacIntyre, 1994)
Theorem.
93 / 174
![Page 131: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/131.jpg)
VC-Dimension of Neural Networks
Consider the class F of −1, 1-valued functions computed by anetwork with L layers, p parameters, and k computation units withthe following nonlinearities:
1. Piecewise constant (linear threshold units):VCdim(F) = Θ (p).
(Baum and Haussler, 1989)
2. Piecewise linear (ReLUs): VCdim(F) = Θ (pL).(B., Harvey, Liaw, Mehrabian, 2017)
3. Piecewise polynomial: VCdim(F) = O(pL2).
(B., Maiorov, Meir, 1998)
4. Sigmoid: VCdim(F) = O(p2k2
).
(Karpinsky and MacIntyre, 1994)
Theorem.
93 / 174
![Page 132: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/132.jpg)
VC-Dimension of Neural Networks
Consider the class F of −1, 1-valued functions computed by anetwork with L layers, p parameters, and k computation units withthe following nonlinearities:
1. Piecewise constant (linear threshold units):VCdim(F) = Θ (p).
(Baum and Haussler, 1989)
2. Piecewise linear (ReLUs): VCdim(F) = Θ (pL).(B., Harvey, Liaw, Mehrabian, 2017)
3. Piecewise polynomial: VCdim(F) = O(pL2).
(B., Maiorov, Meir, 1998)
4. Sigmoid: VCdim(F) = O(p2k2
).
(Karpinsky and MacIntyre, 1994)
Theorem.
93 / 174
![Page 133: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/133.jpg)
Outline
Introduction
Uniform Laws of Large Numbers
ClassificationGrowth FunctionVC-DimensionVC-Dimension of Neural NetworksLarge Margin Classifiers
Beyond ERM
Interpolation
94 / 174
![Page 134: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/134.jpg)
Some Experimental Observations
(Schapire, Freund, B, Lee, 1997)
95 / 174
![Page 135: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/135.jpg)
Some Experimental Observations
(Schapire, Freund, B, Lee, 1997)
96 / 174
![Page 136: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/136.jpg)
Some Experimental Observations
(Lawrence, Giles, Tsoi, 1997)
97 / 174
![Page 137: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/137.jpg)
Some Experimental Observations
(Neyshabur, Tomioka, Srebro, 2015)
98 / 174
![Page 138: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/138.jpg)
Large-Margin Classifiers
I Consider a real-valued function f : X → R used for classification.
I The prediction on x ∈ X is sign(f (x)) ∈ −1, 1.I If yf (x) > 0 then f classifies x correctly.
I We call yf (x) the margin of f on x . (c.f. the perceptron)
I We can view a larger margin as a more confident correctclassification.
I Minimizing a continuous loss, such as
n∑i=1
(f (Xi )− Yi )2, or
n∑i=1
exp (−Yi f (Xi )) ,
encourages large margins.
I For large-margin classifiers, we should expect the fine-grained detailsof f to be less important.
99 / 174
![Page 139: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/139.jpg)
Recall
Contraction property:
If ` is y 7→ `(y , y) 1-Lipschitz for all y and |`| ≤ 1, then
E‖Rn‖`F ≤ 2E‖Rn‖F + c/√n
Unfortunately, classification loss is not Lipschitz. By writing classificationloss as (1− yy ′)/2, we did show that
E‖Rn‖` sign(F) ≤ 2E‖Rn‖sign(F) + c/√n
but we want E‖Rn‖F .
100 / 174
![Page 140: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/140.jpg)
Rademacher Complexity for Lipschitz Loss
Consider the 1/γ-Lipschitz surrogate loss
φ(α) =
1 if α ≤ 0,
1− α/γ if 0 < α < γ,
0 if α ≥ 1.
(yf(x))
Iyf(x)60
yf(x)
101 / 174
![Page 141: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/141.jpg)
Rademacher Complexity for Lipschitz Loss
Large margin loss is an upper bound and classification loss is a lowerbound:
IYf (X ) ≤ 0 ≤ φ(Yf (X )) ≤ IYf (X ) ≤ γ.
So if we can relate the Lipschitz risk Pφ(Yf (X )) to the Lipschitzempirical risk Pnφ(Yf (X )), we have a large margin bound:
PIYf (X ) ≤ 0 ≤ Pφ(Yf (X )) c.f. Pnφ(Yf (X )) ≤ PnIYf (X ) ≤ γ.
102 / 174
![Page 142: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/142.jpg)
Rademacher Complexity for Lipschitz Loss
With high probability, for all f ∈ F ,
L01(f ) = PIYf (X ) ≤ 0≤ Pφ(Yf (X ))
≤ Pnφ(Yf (X )) +c
γE‖Rn‖F + O(1/
√n)
≤ PnIYf (X ) ≤ γ+c
γE‖Rn‖F + O(1/
√n).
Notice that we’ve turned a classification problem into a regressionproblem.
The VC-dimension (which captures arbitrarily fine-grained properties ofthe function class) is no longer important.
103 / 174
![Page 143: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/143.jpg)
Example: Linear Classifiers
Let F = x 7→ 〈w , x〉 : ‖w‖ ≤ 1. Then, since E ‖Rn‖F . n−1/2,
L01(f ) .1
n
n∑i=1
IYi f (Xi ) ≤ γ+1
γ
1√n
Compare to Perceptron.
104 / 174
![Page 144: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/144.jpg)
Risk versus surrogate risk
In general, these margin bounds are upper bounds only.
But for any surrogate loss φ, we can define a transform ψ that gives atight relationship between the excess 0− 1 risk (over the Bayes risk) tothe excess φ-risk:
1. For any P and f , ψ (L01(f )− L∗01) ≤ Lφ(f )− L∗φ.
2. For |X | ≥ 2, ε > 0 and θ ∈ [0, 1], there is a P and an f with
L01(f )− L∗01 = θ
ψ(θ) ≤ Lφ(f )− L∗φ ≤ ψ(θ) + ε.
Theorem.
105 / 174
![Page 145: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/145.jpg)
Generalization: Margins and Size of Parameters
I A classification problem becomes a regression problem if we use aloss function that doesn’t vary too quickly.
I For regression, the complexity of a neural network is controlled bythe size of the parameters, and can be independent of the numberof parameters.
I We have a tradeoff between the fit to the training data (margins)and the complexity (size of parameters):
Pr(sign(f (X )) 6= Y ) ≤ 1
n
n∑i=1
φ(Yi f (Xi )) + pn(f )
I Even if the training set is classified correctly, it might be worthwhileto increase the complexity, to improve this loss function.
106 / 174
![Page 146: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/146.jpg)
Generalization: Margins and Size of Parameters
I A classification problem becomes a regression problem if we use aloss function that doesn’t vary too quickly.
I For regression, the complexity of a neural network is controlled bythe size of the parameters, and can be independent of the numberof parameters.
I We have a tradeoff between the fit to the training data (margins)and the complexity (size of parameters):
Pr(sign(f (X )) 6= Y ) ≤ 1
n
n∑i=1
φ(Yi f (Xi )) + pn(f )
I Even if the training set is classified correctly, it might be worthwhileto increase the complexity, to improve this loss function.
106 / 174
![Page 147: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/147.jpg)
Generalization: Margins and Size of Parameters
I A classification problem becomes a regression problem if we use aloss function that doesn’t vary too quickly.
I For regression, the complexity of a neural network is controlled bythe size of the parameters, and can be independent of the numberof parameters.
I We have a tradeoff between the fit to the training data (margins)and the complexity (size of parameters):
Pr(sign(f (X )) 6= Y ) ≤ 1
n
n∑i=1
φ(Yi f (Xi )) + pn(f )
I Even if the training set is classified correctly, it might be worthwhileto increase the complexity, to improve this loss function.
106 / 174
![Page 148: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/148.jpg)
Generalization: Margins and Size of Parameters
I A classification problem becomes a regression problem if we use aloss function that doesn’t vary too quickly.
I For regression, the complexity of a neural network is controlled bythe size of the parameters, and can be independent of the numberof parameters.
I We have a tradeoff between the fit to the training data (margins)and the complexity (size of parameters):
Pr(sign(f (X )) 6= Y ) ≤ 1
n
n∑i=1
φ(Yi f (Xi )) + pn(f )
I Even if the training set is classified correctly, it might be worthwhileto increase the complexity, to improve this loss function.
106 / 174
![Page 149: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/149.jpg)
In the next few sections, we show that
E[L(fn)− L(fn)
]can be small if the learning algorithm fn has certain properties.
Keep in mind that this is an interesting quantity for two reasons:
I it upper bounds the excess expected loss of ERM, as we showedbefore.
I can be written as an a-posteriori bound
L(fn) ≤ L(fn) + . . .
regardless of whether fn minimized empirical loss.
107 / 174
![Page 150: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/150.jpg)
Outline
Introduction
Uniform Laws of Large Numbers
Classification
Beyond ERMCompression BoundsAlgorithmic StabilityLocal Methods
Interpolation
108 / 174
![Page 151: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/151.jpg)
Compression Set
Let us use the shortened notation for data: S = Z1, . . . ,Zn, and let us
make the dependence of the algorithm fn on the training set explicit:fn = fn[S]. As before, denote G = (x , y) 7→ `(f (x), y) : f ∈ F, and
write gn(·) = `(fn(·), ·). Let us write gn[S](·) to emphasize thedependence.
Suppose there exists a “compression function” Ck which selects from anydataset S of size n a subset of k examples Ck(S) ⊆ S such that
fn[S] = fk [Ck(S)]
That is, the learning algorithm produces the same function when given Sor its subset Ck(S).
One can keep in mind the example of support vectors in SVMs.
109 / 174
![Page 152: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/152.jpg)
Then,
L(fn)− L(fn) = Egn −1
n
n∑i=1
gn(Zi )
= Egk [Ck(S)](Z )− 1
n
n∑i=1
gk [Ck(S)](Zi )
≤ maxI⊆1,...,n,|I |≤k
Egk [SI ](Z )− 1
n
n∑i=1
gk [SI ](Zi )
where SI is the subset indexed by I .
110 / 174
![Page 153: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/153.jpg)
Since gk [SI ] only depends on k out of n points, the empirical average is“mostly out of sample”. Adding and subtracting loss functions for anadditional set of i.i.d. random variables W = Z ′1, . . . ,Z ′k results in anupper bound
maxI⊆1,...,n,|I |≤k
Egk [SI ](Z )− 1
n
∑Z ′∈S′
gk [SI ](Z ′)
+(b − a)k
n
where [a, b] is the range of functions in G and S ′ is obtained from S byreplacing SI with the corresponding subset WI .
111 / 174
![Page 154: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/154.jpg)
For each fixed I , the random variable
Egk [SI ](Z )− 1
n
∑Z ′∈S′
gk [SI ](Z ′)
is zero mean with standard deviation O((b − a)/√n). Hence, the
expected maximum over I with respect to S,W is at most
c
√(b − a)k log(en/k)
n
since log(nk
)≤ k log(en/k).
112 / 174
![Page 155: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/155.jpg)
Conclusion: compression-style argument limits the bias
E[L(fn)− L(fn)
]≤ O
(√k log n
n
),
which is non-vacuous if k = o(n/ log n).
Recall that this term was the upper bound (up to log) on expected excessloss of ERM if class has VC dimension k . However, a possible equivalencebetween compression and VC dimension is still being investigated.
113 / 174
![Page 156: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/156.jpg)
Example: Classification with Thresholds in 1D
I X = [0, 1], Y = 0, 1I F = fθ : fθ(x) = Ix ≥ θ, θ ∈ [0, 1]I `(fθ(x), y) = Ifθ(x) 6= y
0 1
fn
For any set of data (x1, y1), . . . , (xn, yn), the ERM solution fn has theproperty that the first occurrence xl on the left of the threshold has labelyl = 0, while first occurrence xr on the right – label yr = 1.
Enough to take k = 2 and define fn[S] = f2[(xl , 0), (xr , 1)].
114 / 174
![Page 157: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/157.jpg)
Further examples/observations:
I Compression of size d for hyperplanes (realizable case)
I Compression of size 1/γ2 for margin case
I Bernstein bound gives 1/n rate rather than 1/√n rate on realizable
data (zero empirical error).
115 / 174
![Page 158: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/158.jpg)
Outline
Introduction
Uniform Laws of Large Numbers
Classification
Beyond ERMCompression BoundsAlgorithmic StabilityLocal Methods
Interpolation
116 / 174
![Page 159: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/159.jpg)
Algorithmic Stability
Recall that compression was a way to upper bound E[L(fn)− L(fn)
].
Algorithmic stability is another path to the same goal.
Compare:
I Compression: fn depends only on a subset of k datapoints.
I Stability: fn does not depend on any of the datapoints too strongly.
As before, let’s write shorthand g = ` f and gn = ` fn.
117 / 174
![Page 160: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/160.jpg)
Algorithmic StabilityWe now write
ESL(fn) = EZ1,...,Zn,Z gn[Z1, . . . ,Zn](Z )
Again, the meaning of gn[Z1, . . . ,Zn](Z ): train on Z1, . . . ,Zn and test onZ .
On the other hand,
ES L(fn) = EZ1,...,Zn
1
n
n∑i=1
gn[Z1, . . . ,Zn](Zi )
=1
n
n∑i=1
EZ1,...,Zn gn[Z1, . . . ,Zn](Zi )
= EZ1,...,Zn gn[Z1, . . . ,Zn](Z1)
where the last step holds for symmetric algorithms (wrt permutation oftraining data). Of course, instead of Z1 we can take any Zi .
118 / 174
![Page 161: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/161.jpg)
Now comes the renaming trick. It takes a minute to get used to, if youhaven’t seen it.
Note that Z1, . . . ,Zn,Z are i.i.d. Hence,
ESL(fn) = EZ1,...,Zn,Z gn[Z1, . . . ,Zn](Z ) = EZ1,...,Zn,Z gn[Z ,Z2, . . . ,Zn](Z1)
Therefore,
EL(fn)− L(fn)
= EZ1,...,Zn,Z gn[Z ,Z2, . . . ,Zn](Z1)− gn[Z1, . . . ,Zn](Z1)
119 / 174
![Page 162: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/162.jpg)
Of course, we haven’t really done much except re-writing expectation.But the difference
gn[Z ,Z2, . . . ,Zn](Z1)− gn[Z1, . . . ,Zn](Z1)
has a “stability” interpretation. If it holds that the output of thealgorithm “does not change much” when one datapoint is replaced with
another, then the gap EL(fn)− L(fn)
is small.
Moreover, since everything we’ve written is an equality, this stability is
equivalent to having small gap EL(fn)− L(fn)
.
120 / 174
![Page 163: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/163.jpg)
NB: our aim of ensuring small EL(fn)− L(fn)
only makes sense if
L(fn) is small (e.g. on average). That is, the analysis only makes sensefor those methods that explicitly or implicitly minimize empirical loss (ora regularized variant of it).
It’s not enough to be stable. Consider a learning mechanism that ignoresthe data and outputs fn = f0, a constant function. Then
EL(fn)− L(fn)
= 0 and the algorithm is very stable. However, it does
not do anything interesting.
121 / 174
![Page 164: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/164.jpg)
Uniform Stability
Rather than the average notion we just discussed, let’s consider a muchstronger notion:
We say that algorithm is β uniformly stable if
∀i ∈ [n], z1, . . . , zn, z′, z
∣∣∣gn[S](z)− gn[S i,z′ ](z)∣∣∣ ≤ β
where S i,z′ = z1, . . . , zi−1, z′, zi+1, . . . , zn.
122 / 174
![Page 165: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/165.jpg)
Uniform Stability
Clearly, for any realization of Z1, . . . ,Zn,Z ,
gn[Z ,Z2, . . . ,Zn](Z1)− gn[Z1, . . . ,Zn](Z1) ≤ β,
and so expected loss of a β-uniformly-stable ERM is β-close to itsempirical error (in expectation). See recent work by Feldman andVondrak for sharp high probability statements.
Of course, it is unclear at this point whether a β-uniformly-stable ERM(or near-ERM) exists.
123 / 174
![Page 166: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/166.jpg)
Kernel Ridge Regression
Consider
fn = argminf∈H
1
n
n∑i=1
(f (Xi )− Yi )2 + λ ‖f ‖2
K
in RKHS H corresponding to kernel K .
Assume K (x , x) ≤ κ2 for any x .
Kernel Ridge Regression is β-uniformly stable with β = O(
1λn
)Lemma.
124 / 174
![Page 167: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/167.jpg)
Proof (stability of Kernel Ridge Regression)
To prove this, first recall the definition of a σ-strongly convex function φon convex domain W:
∀u, v ∈ W, φ(u) ≥ φ(v) + 〈∇φ(v), u − v〉+σ
2‖u − v‖2
.
Suppose φ, φ′ are both σ-strongly convex. Suppose w ,w ′ satisfy∇φ(w) = ∇φ′(w ′) = 0. Then
φ(w ′) ≥ φ(w) +σ
2‖w − w ′‖2
andφ′(w) ≥ φ′(w ′) +
σ
2‖w − w ′‖2
As a trivial consequence,
σ ‖w − w ′‖2 ≤ [φ(w ′)− φ′(w ′)] + [φ′(w)− φ(w)]
125 / 174
![Page 168: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/168.jpg)
Proof (stability of Kernel Ridge Regression)Now take
φ(f ) =1
n
∑i∈S
(f (xi )− yi )2 + λ ‖f ‖2
K
and
φ′(f ) =1
n
∑i∈S′
(f (xi )− yi )2 + λ ‖f ‖2
K
where S and S ′ differ in one element: (xi , yi ) is replaced with (x ′i , y′i ).
Let fn, f′n be the minimizers of φ, φ′, respectively. Then
φ(f ′n)− φ′(f ′n) ≤ 1
n
((f ′n(xi )− yi )
2 − (f ′n(x ′i )− y ′i )2)
and
φ′(fn)− φ(fn) ≤ 1
n
((fn(x ′i )− y ′i )2 − (fn(xi )− yi )
2)
NB: we have been operating with f as vectors. To be precise, one needsto define the notion of strong convexity over H. Let us sweep it underthe rug and say that φ, φ′ are 2λ-strongly convex with respect to ‖·‖K .
126 / 174
![Page 169: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/169.jpg)
Proof (stability of Kernel Ridge Regression)
Then∥∥∥fn − f ′n
∥∥∥2
Kis at most
1
2λn
((f ′n(xi )− yi )
2 − (fn(xi )− yi )2 + (fn(x ′i )− y ′i )2 − (f ′n(x ′i )− y ′i )2
)which is at most
1
2λnC∥∥∥fn − f ′n
∥∥∥∞
where C = 4(1 + c) if |Yi | ≤ 1 and |fn(xi )| ≤ c .
On the other hand, for any x
f (x) = 〈f ,Kx〉 ≤ ‖f ‖K ‖Kx‖ = ‖f ‖K√〈Kx ,Kx〉 = ‖f ‖K
√K (x , x) ≤ κ ‖f ‖K
and so‖f ‖∞ ≤ κ ‖f ‖K .
127 / 174
![Page 170: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/170.jpg)
Proof (stability of Kernel Ridge Regression)
Putting everything together,∥∥∥fn − f ′n
∥∥∥2
K≤ 1
2λnC∥∥∥fn − f ′n
∥∥∥∞≤ κC
2λn
∥∥∥fn − f ′n
∥∥∥K
Hence, ∥∥∥fn − f ′n
∥∥∥K≤ 1
2λnC∥∥∥fn − f ′n
∥∥∥∞≤ κC
2λn
To finish the claim,
(fn(xi )−yi )2−(f ′n(xi )−yi )2 ≤ C∥∥∥fn − f ′n
∥∥∥∞≤ κC
∥∥∥fn − f ′n
∥∥∥K≤ O
(1
λn
)
128 / 174
![Page 171: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/171.jpg)
Outline
Introduction
Uniform Laws of Large Numbers
Classification
Beyond ERMCompression BoundsAlgorithmic StabilityLocal Methods
Interpolation
129 / 174
![Page 172: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/172.jpg)
We consider “local” procedures such as k-Nearest-Neighbors or localsmoothing. They have a different bias-variance decomposition (we do notfix a class F).
Analysis will rely on local similarity (e.g. Lipschitz-ness) of regressionfunction f ∗.Idea: to predict y at a given x , look up in the dataset those Yi for whichXi is “close” to x .
130 / 174
![Page 173: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/173.jpg)
Bias-VarianceIt’s time to revisit the bias-variance picture. Recall that our goal was toensure that
EL(fn)− L(f ∗)
decreases with data size n, where f ∗ gives smallest possible L.
For “simple problems” (that is, strong assumptions on P), one canensure this without the bias-variance decomposition. Examples:Perceptron, linear regression in d < n regime, etc.
However, for more interesting problems, we cannot get this difference tobe small without introducing bias, because otherwise the variance(fluctuation of the stochastic part) is too large.
Our approach so far was to split this term into anestimation-approximation error with respect to some class F :
EL(fn)− L(fF ) + L(fF )− L(f ∗)
131 / 174
![Page 174: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/174.jpg)
Bias-Variance
Now we’ll study a different bias-variance decomposition, typically used innonparametric statistics. We will only work with square loss.
Rather than fixing F that controls the estimation error, we fix analgorithm (procedure/estimator) fn that has some tunable parameter.
By definition E[Y |X = x ] = f ∗(x). Then we write
EL(fn)− L(f ∗) = E(fn(X )− Y )2 − E(f ∗(X )− Y )2
= E(fn(X )− f ∗(X ) + f ∗(X )− Y )2 − E(f ∗(X )− Y )2
= E(fn(X )− f ∗(X ))2
because the cross term vanishes (check!)
132 / 174
![Page 175: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/175.jpg)
Bias-Variance
Before proceeding, let us discuss the last expression.
E(fn(X )− f ∗(X ))2 = ES∫x
(fn(x)− f ∗(x))2P(dx)
=
∫x
ES(fn(x)− f ∗(x))2P(dx)
We will often analyze ES(fn(x)− f ∗(x))2 for fixed x and then integrate.
The integral is a measure of distance between two functions:
‖f − g‖2L2(P) =
∫x
(f (x)− g(x))2P(dx).
133 / 174
![Page 176: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/176.jpg)
Bias-Variance
Let us drop L2(P) from notation for brevity. The bias-variancedecomposition can be written as
ES∥∥∥fn − f ∗
∥∥∥2
L2(P)= ES
∥∥∥fn − EY1:n [fn] + EY1:n [fn]− f ∗∥∥∥2
= ES∥∥∥fn − EY1:n [fn]
∥∥∥2
+ EX1:n
∥∥∥EY1:n [fn]− f ∗∥∥∥2
,
because the cross term is zero in expectation.
The first term is variance, the second is squared bias. One typicallyincreases with the parameter, the other decreases.
Parameter is chosen either (a) theoretically or (b) by cross-validation(this is the usual case in practice).
134 / 174
![Page 177: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/177.jpg)
k-Nearest Neighbors
As before, we are given (X1,Y1), . . . , (Xn,Yn) i.i.d. from P. To make aprediction of Y at a given x , we sort points according to distance‖Xi − x‖. Let
(X(1),Y(1)), . . . , (X(n),Y(n))
be the sorted list (remember this depends on x).
The k-NN estimate is defined as
fn(x) =1
k
k∑i=1
Y(i).
135 / 174
![Page 178: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/178.jpg)
Variance: Given x ,
fn(x)− EY1:n [fn(x)] =1
k
k∑i=1
(Y(i) − f ∗(X(i)))
which is on the order of 1/√k . Then variance is of the order 1
k .
Bias: a bit more complicated. For a given x ,
EY1:n [fn(x)]− f ∗(x) =1
k
k∑i=1
(f ∗(X(i))− f ∗(x)).
Suppose f ∗ is 1-Lipschitz. Then the square of above is(1
k
k∑i=1
(f ∗(X(i))− f ∗(x))
)2
≤ 1
k
k∑i=1
∥∥X(i) − x∥∥2
So, the bias is governed by how close the closest k random points are tox .
136 / 174
![Page 179: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/179.jpg)
Claim: enough to know the upper bound on the closest point to x amongn points.
Argument: for simplicity assume that n/k is an integer. Divide theoriginal (unsorted) dataset into k blocks, n/k size each. Let X i be theclosest point to x in ith block. Then the collection X 1, . . . ,X k , ak-subset which is no closer than the set of k nearest neighbors. That is,
1
k
k∑i=1
∥∥X(i) − x∥∥2 ≤ 1
k
k∑i=1
∥∥X i − x∥∥2
Taking expectation (with respect to dataset), the bias term is at most
E
1
k
k∑i=1
∥∥X i − x∥∥2
= E
∥∥X 1 − x∥∥2
which is expected squared distance from x to the closest point in arandom set of n/k points.
137 / 174
![Page 180: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/180.jpg)
If support of X is bounded and d ≥ 3, then one can estimate
E∥∥X − X(1)
∥∥2. n−2/d .
That is, we expect the closest neighbor of a random point X to be nofurther than n−1/d away from one of n randomly drawn points.
Thus, the bias (which is the expected squared distance from x to theclosest point in a random set of n/k points) is at most
(n/k)−2/d .
138 / 174
![Page 181: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/181.jpg)
Putting everything together, the bias-variance decomposition yields
1
k+
(k
n
)2/d
Optimal choice is k ∼ n2
2+d and the overall rate of estimation at a givenpoint x is
n−2
2+d .
Since the result holds for any x , the integrated risk is also
E∥∥∥fn − f ∗
∥∥∥2
. n−2
2+d .
139 / 174
![Page 182: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/182.jpg)
Summary
I We sketched the proof that k-Nearest-Neighbors has samplecomplexity guarantees for prediction or estimation problems withsquare loss if k is chosen appropriately.
I Analysis is very different from “empirical process” approach forERM.
I Truly nonparametric!
I No assumptions on underlying density (in d ≥ 3) beyond compactsupport. Additional assumptions needed for d ≤ 3.
140 / 174
![Page 183: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/183.jpg)
Local Kernel Regression: Nadaraya-Watson
Fix a kernel K : Rd → R≥0. Assume K is zero outside unit Euclidean ball
at origin (not true for e−x2
, but close enough).
(figure from Gyorfi et al)
Let Kh(x) = K (x/h), and so Kh(x − x ′) is zero if ‖x − x ′‖ ≥ h.
h is “bandwidth” – tunable parameter.
Assume K (x) > cI‖x‖ ≤ 1 for some c > 0. This is important for the“averaging effect” to kick in.
141 / 174
![Page 184: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/184.jpg)
Nadaraya-Watson estimator:
fn(x) =n∑
i=1
YiWi (x)
with
Wi (x) =Kh(x − Xi )∑ni=1 Kh(x − Xi )
(Note:∑
i Wi = 1).
142 / 174
![Page 185: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/185.jpg)
Unlike the k-NN example, bias is easier to estimate.
Bias: for a given x ,
EY1:n [fn(x)] = EY1:n
[n∑
i=1
YiWi (x)
]=
n∑i=1
f ∗(Xi )Wi (x)
and so
EY1:n [fn(x)]− f ∗(x) =n∑
i=1
(f ∗(Xi )− f ∗(x))Wi (x)
Suppose f ∗ is 1-Lipschitz. Since Kh is zero outside the h-radius ball,
|EY1:n [fn(x)]− f ∗(x)|2 ≤ h2.
143 / 174
![Page 186: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/186.jpg)
Variance: we have
fn(x)− EY1:n [fn(x)] =n∑
i=1
(Yi − f ∗(Xi ))Wi (x)
Expectation of square of this difference is at most
E
[n∑
i=1
(Yi − f ∗(Xi ))2Wi (x)2
]
since cross terms are zero (fix X ’s, take expectation with respect to theY ’s).
We are left analyzing
nE[
Kh(x − X1)2
(∑n
i=1 Kh(x − Xi ))2
]
Under some assumptions on density of X , the denominator is at least(nhd)2 with high prob, whereas EKh(x − X1)2 = O(hd) assuming∫K 2 <∞. This gives an overall variance of O(1/(nhd)). Many details
skipped here (e.g. problems at the boundary, assumptions, etc)144 / 174
![Page 187: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/187.jpg)
Overall, bias and variance with h ∼ n−1
2+d yield
h2 +1
nhd= n−
22+d
145 / 174
![Page 188: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/188.jpg)
Summary
I Analyzed smoothing methods with kernels. As with nearestneighbors, slow (nonparametric) rates in large d .
I Same bias-variance decomposition approach as k-NN.
146 / 174
![Page 189: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/189.jpg)
Outline
Introduction
Uniform Laws of Large Numbers
Classification
Beyond ERM
InterpolationOverfitting in Deep NetworksNearest neighborLocal Kernel SmoothingKernel and Linear Regression
147 / 174
![Page 190: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/190.jpg)
Overfitting in Deep Networks
I Deep networks can be trained to zerotraining error (for regression loss)
I ... with near state-of-the-artperformance (Ruslan Salakhutdinov,
Simons Machine Learning Boot Camp, January 2017)
I ... even for noisy problems.
I No tradeoff between fit to trainingdata and complexity!
I Benign overfitting.
(Zhang, Bengio, Hardt, Recht, Vinyals, 2017)
148 / 174
![Page 191: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/191.jpg)
Overfitting in Deep Networks
I Deep networks can be trained to zerotraining error (for regression loss)
I ... with near state-of-the-artperformance (Ruslan Salakhutdinov,
Simons Machine Learning Boot Camp, January 2017)
I ... even for noisy problems.
I No tradeoff between fit to trainingdata and complexity!
I Benign overfitting.
(Zhang, Bengio, Hardt, Recht, Vinyals, 2017)
148 / 174
![Page 192: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/192.jpg)
Overfitting in Deep Networks
I Deep networks can be trained to zerotraining error (for regression loss)
I ... with near state-of-the-artperformance (Ruslan Salakhutdinov,
Simons Machine Learning Boot Camp, January 2017)
I ... even for noisy problems.
I No tradeoff between fit to trainingdata and complexity!
I Benign overfitting.
(Zhang, Bengio, Hardt, Recht, Vinyals, 2017)
148 / 174
![Page 193: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/193.jpg)
Overfitting in Deep Networks
I Deep networks can be trained to zerotraining error (for regression loss)
I ... with near state-of-the-artperformance (Ruslan Salakhutdinov,
Simons Machine Learning Boot Camp, January 2017)
I ... even for noisy problems.
I No tradeoff between fit to trainingdata and complexity!
I Benign overfitting.
(Zhang, Bengio, Hardt, Recht, Vinyals, 2017)
148 / 174
![Page 194: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/194.jpg)
Overfitting in Deep Networks
I Deep networks can be trained to zerotraining error (for regression loss)
I ... with near state-of-the-artperformance (Ruslan Salakhutdinov,
Simons Machine Learning Boot Camp, January 2017)
I ... even for noisy problems.
I No tradeoff between fit to trainingdata and complexity!
I Benign overfitting.
(Zhang, Bengio, Hardt, Recht, Vinyals, 2017)
148 / 174
![Page 195: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/195.jpg)
Overfitting in Deep Networks
Not surprising:
I More parameters than sample size. (That’s called ’nonparametric.’)
I Good performance with zero classification loss on training data.(c.f. Margins analysis: trade-off between (regression) fit andcomplexity.)
I Good performance with zero regression loss on training data whenthe Bayes risk is zero. (c.f. Perceptron)
Surprising:
I Good performance with zero regression loss on noisy training data.
I All the label noise is making its way into the parameters, but it isn’thurting prediction accuracy!
149 / 174
![Page 196: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/196.jpg)
Overfitting in Deep Networks
Not surprising:
I More parameters than sample size. (That’s called ’nonparametric.’)
I Good performance with zero classification loss on training data.(c.f. Margins analysis: trade-off between (regression) fit andcomplexity.)
I Good performance with zero regression loss on training data whenthe Bayes risk is zero. (c.f. Perceptron)
Surprising:
I Good performance with zero regression loss on noisy training data.
I All the label noise is making its way into the parameters, but it isn’thurting prediction accuracy!
149 / 174
![Page 197: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/197.jpg)
Overfitting in Deep Networks
Not surprising:
I More parameters than sample size. (That’s called ’nonparametric.’)
I Good performance with zero classification loss on training data.(c.f. Margins analysis: trade-off between (regression) fit andcomplexity.)
I Good performance with zero regression loss on training data whenthe Bayes risk is zero. (c.f. Perceptron)
Surprising:
I Good performance with zero regression loss on noisy training data.
I All the label noise is making its way into the parameters, but it isn’thurting prediction accuracy!
149 / 174
![Page 198: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/198.jpg)
Overfitting in Deep Networks
Not surprising:
I More parameters than sample size. (That’s called ’nonparametric.’)
I Good performance with zero classification loss on training data.(c.f. Margins analysis: trade-off between (regression) fit andcomplexity.)
I Good performance with zero regression loss on training data whenthe Bayes risk is zero. (c.f. Perceptron)
Surprising:
I Good performance with zero regression loss on noisy training data.
I All the label noise is making its way into the parameters, but it isn’thurting prediction accuracy!
149 / 174
![Page 199: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/199.jpg)
Overfitting in Deep Networks
Not surprising:
I More parameters than sample size. (That’s called ’nonparametric.’)
I Good performance with zero classification loss on training data.(c.f. Margins analysis: trade-off between (regression) fit andcomplexity.)
I Good performance with zero regression loss on training data whenthe Bayes risk is zero. (c.f. Perceptron)
Surprising:
I Good performance with zero regression loss on noisy training data.
I All the label noise is making its way into the parameters, but it isn’thurting prediction accuracy!
149 / 174
![Page 200: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/200.jpg)
Overfitting in Deep Networks
Not surprising:
I More parameters than sample size. (That’s called ’nonparametric.’)
I Good performance with zero classification loss on training data.(c.f. Margins analysis: trade-off between (regression) fit andcomplexity.)
I Good performance with zero regression loss on training data whenthe Bayes risk is zero. (c.f. Perceptron)
Surprising:
I Good performance with zero regression loss on noisy training data.
I All the label noise is making its way into the parameters, but it isn’thurting prediction accuracy!
149 / 174
![Page 201: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/201.jpg)
Overfitting
“... interpolating fits... [are] unlikely to predict future data well at all.”“Elements of Statistical Learning,” Hastie, Tibshirani, Friedman, 2001
150 / 174
![Page 202: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/202.jpg)
Overfitting
151 / 174
![Page 203: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/203.jpg)
Benign Overfitting
A new statistical phenomenon:good prediction with zero training error for regression loss
I Statistical wisdom says a prediction rule should not fit too well.
I But deep networks are trained to fit noisy data perfectly, and theypredict well.
152 / 174
![Page 204: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/204.jpg)
Not an Isolated Phenomenon
0.0 0.2 0.4 0.6 0.8 1.0 1.2lambda
10 1
log(
erro
r)Kernel Regression on MNIST
digits pair [i,j][2,5][2,9][3,6][3,8][4,7]
λ = 0: the interpolated solution, perfect fit on training data.
153 / 174
![Page 205: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/205.jpg)
Not an Isolated Phenomenon
154 / 174
![Page 206: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/206.jpg)
Outline
Introduction
Uniform Laws of Large Numbers
Classification
Beyond ERM
InterpolationOverfitting in Deep NetworksNearest neighborLocal Kernel SmoothingKernel and Linear Regression
155 / 174
![Page 207: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/207.jpg)
Simplicial interpolation (Belkin-Hsu-Mitra)
Observe: 1-Nearest Neighbor is interpolating. However, one cannotguarantee EL(fn)− L(f ∗) small.
Cover-Hart ’67:lim
n→∞EL(fn) ≤ 2L(f ∗)
156 / 174
![Page 208: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/208.jpg)
Simplicial interpolation (Belkin-Hsu-Mitra)
Under regularity conditions, simplicial interpolation fn
lim supn→∞
E∥∥∥fn − f∗
∥∥∥2
≤ 2
d + 2E(f ∗(X )− Y )2
Blessing of high dimensionality!
157 / 174
![Page 209: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/209.jpg)
Outline
Introduction
Uniform Laws of Large Numbers
Classification
Beyond ERM
InterpolationOverfitting in Deep NetworksNearest neighborLocal Kernel SmoothingKernel and Linear Regression
158 / 174
![Page 210: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/210.jpg)
Nadaraya-Watson estimator (Belkin-R.-Tsybakov)
Consider the Nadaraya-Watson estimator. Take a kernel that approachesa large value τ at 0, e.g.
K (x) = min1/ ‖x‖α , τ
Note that large τ means fn(Xi ) ≈ Yi since the weight Wi (Xi ) is large.
If τ =∞, we get interpolation fn(Xi ) = Yi of all training data. Yet,earlier proof still goes through. Hence, “memorizing the data” (governedby parameter τ) is completely decoupled from the bias-variance trade-off(as given by parameter h).
Minimax rates are possible (with a suitable choice of h).
159 / 174
![Page 211: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/211.jpg)
Outline
Introduction
Uniform Laws of Large Numbers
Classification
Beyond ERM
InterpolationOverfitting in Deep NetworksNearest neighborLocal Kernel SmoothingKernel and Linear Regression
160 / 174
![Page 212: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/212.jpg)
Min-norm interpolation in RKHS
Min-norm interpolation:
fn := argminf∈H
‖f ‖H, s.t. f (xi ) = yi , ∀i ≤ n .
Closed-form solution:
fn(x) = K (x ,X )K (X ,X )−1Y
Note: GD/SGD started from 0 converges to this minimum-norm solution.
Difficulty in analyzing bias/variance of fn: it is not clear how the randommatrix K−1 “aggregates” Y ’s between data points.
161 / 174
![Page 213: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/213.jpg)
Min-norm interpolation in Rd with d n
(Hastie, Montanari, Rosset, Tibshirani, 2019)
Linear regression with p, n→∞, p/n→ γ, empirical spectral distributionof Σ (the discrete measure on its set of eigenvalues) converges to a fixedmeasure.Apply random matrix theory to explore the asymptotics of the excessprediction error as γ, the noise variance, and the eigenvalue distributionvary.Also study the asymptotics of a model involving random nonlinearfeatures.
162 / 174
![Page 214: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/214.jpg)
Implicit regularization in regime d n(Liang and R., 2018)
Informal Theorem: Geometric properties of the data design X , highdimensionality, and curvature of the kernel ⇒ interpolated solutiongeneralizes.
Bias bounded in terms of effective dimension
dim(XX>) =∑j
λj
(XX>
d
)[γ + λj
(XX>
d
)]2
where γ > 0 is due to implicit regularization due to curvature of kerneland high dimensionality.
Compare with regularized least squares as low pass filter:
dim(XX>) =∑j
λj
(XX>
d
)λ+ λj
(XX>
d
)163 / 174
![Page 215: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/215.jpg)
Implicit regularization in regime d n
Test error (y-axis) as a function of how quickly spectrum decays. Bestperformance if spectrum decays but not too quickly.
λj(Σ) =
(1−
(j
d
)κ)1/κ
164 / 174
![Page 216: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/216.jpg)
Lower Bounds
(R. and Zhai, 2018)
Min-norm solution for Laplace kernel
Kσ(x , x ′) = σ−d exp−‖x − x ′‖ /σ
is not consistent if d is low:
Theorem: for odd dimension d , with probability 1−O(n−1/2), forany choice of σ,
E∥∥∥fn − f ∗
∥∥∥2
≥ Ωd(1).
Theorem.
Conclusion: need high dimensionality of data.
165 / 174
![Page 217: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/217.jpg)
Connection to neural networksFor wide enough randomly initialized neural networks, GD dynamicsquickly converge to (approximately) min-norm interpolating solution withrespect to a certain kernel.
f (x) =1√m
m∑i=1
aiσ(〈wi , x〉)
For square loss and relu σ,
∂L
∂wi=
1√m
1
n
n∑j=1
(f (xj)− yj)aiσ′(〈wi , xj〉)xj
GD in continuous time:
d
dtwi (t) = − ∂L
∂wi
Thusd
dtf (x) =
m∑i=1
(∂f (x)
∂wi
)T (dwi
dt
)166 / 174
![Page 218: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/218.jpg)
Connection to neural networksFor wide enough randomly initialized neural networks, GD dynamicsquickly converge to (approximately) min-norm interpolating solution withrespect to a certain kernel.
f (x) =1√m
m∑i=1
aiσ(〈wi , x〉)
For square loss and relu σ,
∂L
∂wi=
1√m
1
n
n∑j=1
(f (xj)− yj)aiσ′(〈wi , xj〉)xj
GD in continuous time:
d
dtwi (t) = − ∂L
∂wi
Thus
d
dtf (x) =
m∑i=1
(1√maiσ′(〈wi , x〉)x
)T− 1√
m
1
n
n∑j=1
(f (xj)− yj)aiσ′(〈wi , xj〉)xj
166 / 174
![Page 219: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/219.jpg)
Connection to neural networksFor wide enough randomly initialized neural networks, GD dynamicsquickly converge to (approximately) min-norm interpolating solution withrespect to a certain kernel.
f (x) =1√m
m∑i=1
aiσ(〈wi , x〉)
For square loss and relu σ,
∂L
∂wi=
1√m
1
n
n∑j=1
(f (xj)− yj)aiσ′(〈wi , xj〉)xj
GD in continuous time:
d
dtwi (t) = − ∂L
∂wi
Thus
d
dtf (x) = − 1
n
n∑j=1
(f (xj)− yj)
[1
m
m∑i=1
a2i σ′(〈wi , x〉)σ′(〈wi , xj〉) 〈x , xj〉
]
166 / 174
![Page 220: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/220.jpg)
Connection to neural networksFor wide enough randomly initialized neural networks, GD dynamicsquickly converge to (approximately) min-norm interpolating solution withrespect to a certain kernel.
f (x) =1√m
m∑i=1
aiσ(〈wi , x〉)
For square loss and relu σ,
∂L
∂wi=
1√m
1
n
n∑j=1
(f (xj)− yj)aiσ′(〈wi , xj〉)xj
GD in continuous time:
d
dtwi (t) = − ∂L
∂wi
Thusd
dtf (x) = − 1
n
n∑j=1
(f (xj)− yj)Km(x , xj)︸ ︷︷ ︸→K∞(x,xj )
166 / 174
![Page 221: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/221.jpg)
Connection to neural networks
(Xie, Liang, Song, ’16), (Jacot, Gabriel, Hongler ’18), (Du, Zhai, Poczos, Singh ’18),
etc.
K∞(x , xj) = E[〈x , xj〉 I〈w , x〉 ≥ 0, 〈w , xj〉 ≥ 0]
See Jason’s talk.
167 / 174
![Page 222: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/222.jpg)
Linear Regression(B., Long, Lugosi, Tsigler, 2019)
I (x , y) Gaussian, mean zero.
I Define:
Σ := Exx> =∑i
λiviv>i , (assume λ1 ≥ λ2 ≥ · · · )
θ∗ := arg minθ
E(y − x>θ
)2,
σ2 := E(y − x>θ∗)2.
I Estimator θ =(X>X
)†X>y , which solves
minθ
‖θ‖2
s.t. ‖Xθ − y‖2 = minβ‖Xβ − y‖2
= 0.
I When can the label noise be hidden in θ without hurting predictiveaccuracy?
168 / 174
![Page 223: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/223.jpg)
Linear Regression(B., Long, Lugosi, Tsigler, 2019)
I (x , y) Gaussian, mean zero.
I Define:
Σ := Exx> =∑i
λiviv>i , (assume λ1 ≥ λ2 ≥ · · · )
θ∗ := arg minθ
E(y − x>θ
)2,
σ2 := E(y − x>θ∗)2.
I Estimator θ =(X>X
)†X>y , which solves
minθ
‖θ‖2
s.t. ‖Xθ − y‖2 = minβ‖Xβ − y‖2 = 0.
I When can the label noise be hidden in θ without hurting predictiveaccuracy?
168 / 174
![Page 224: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/224.jpg)
Linear Regression
Excess prediction error:
L(θ)− L∗ := E(x,y)
[(y − x>θ
)2
−(y − x>θ∗
)2]
=(θ − θ∗
)>Σ(θ − θ∗
).
Σ determines how estimation error in different parameter directions affectprediction accuracy.
169 / 174
![Page 225: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/225.jpg)
Linear Regression
For universal constants b, c , and any θ∗, σ2, Σ, if λn > 0 then
σ2
cmin
k∗
n+
n
Rk∗(Σ), 1
≤ EL(θ)− L∗ ≤ c
(‖θ∗‖2
√tr(Σ)
n+ σ2
(k∗
n+
n
Rk∗(Σ)
)),
where k∗ = min k ≥ 0 : rk(Σ) ≥ bn.
Theorem.
170 / 174
![Page 226: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/226.jpg)
Notions of Effective Rank
Recall that λ1 ≥ λ2 ≥ · · · are the eigenvalues of Σ. Define theeffective ranks
rk(Σ) =
∑i>k λi
λk+1, Rk(Σ) =
(∑i>k λi
)2∑i>k λ
2i
.
Definition.
ExampleIf rank(Σ) = p, we can write
r0(Σ) = rank(Σ)s(Σ), R0(Σ) = rank(Σ)S(Σ),
with s(Σ) =1/p
∑pi=1 λi
λ1, S(Σ) =
(1/p
∑pi=1 λi
)2
1/p∑p
i=1 λ2i
.
Both s and S lie between 1/p (λ2 ≈ 0) and 1 (λi all equal).
171 / 174
![Page 227: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/227.jpg)
Notions of Effective Rank
Recall that λ1 ≥ λ2 ≥ · · · are the eigenvalues of Σ. Define theeffective ranks
rk(Σ) =
∑i>k λi
λk+1, Rk(Σ) =
(∑i>k λi
)2∑i>k λ
2i
.
Definition.
ExampleIf rank(Σ) = p, we can write
r0(Σ) = rank(Σ)s(Σ), R0(Σ) = rank(Σ)S(Σ),
with s(Σ) =1/p
∑pi=1 λi
λ1, S(Σ) =
(1/p
∑pi=1 λi
)2
1/p∑p
i=1 λ2i
.
Both s and S lie between 1/p (λ2 ≈ 0) and 1 (λi all equal).171 / 174
![Page 228: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/228.jpg)
Benign Overfitting in Linear Regression
Intuition
I To avoid harming prediction accuracy, the noise energy must bedistributed across many unimportant directions:
I There should be many non-zero eigenvalues.I They should have a small sum compared to n.I There should be many eigenvalues no larger than λk∗
(the number depends on how assymmetric they are).
172 / 174
![Page 229: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/229.jpg)
Overfitting?
I Fitting data “too” well versus
I Bias too low, variance too high.
Key takeaway: we should not conflate these two.
173 / 174
![Page 230: Peter Bartlett Alexander Rakhlin - GitHub Pages · kw t+1k 2 = kw t + y tx tk 2 kw tk 2 + 2y t hw t;x ti+ 1 kw tk 2 + 1: Hence, kw Tk 2 m: For optimal hyperplane w hw;y tx ti= hw;w](https://reader035.vdocuments.mx/reader035/viewer/2022071412/6109e1ce2dcef707147647f0/html5/thumbnails/230.jpg)
Outline
Introduction
Uniform Laws of Large Numbers
Classification
Beyond ERM
InterpolationOverfitting in Deep NetworksNearest neighborLocal Kernel SmoothingKernel and Linear Regression
174 / 174