fat-shattering dimension and majorizing measures: two ways to...

Fat-Shattering Dimension and Majorizing Measures:

two ways to quantify complexity

David PollardStatistics DepartmentYale Universityhttp://www.stat.yale.edu/~pollard

Talk given at the

Workshop on Simplicity and LikelihoodYale University

November 2009

http://cowles.econ.yale.edu/conferences/simplicity/

http://www.stat.yale.edu/~pollard

http://cowles.econ.yale.edu/conferences/simplicity/

VC classes of sets

Q = all quadrants (−∞, a]× (−∞, b] in R2

I How many patterns picked out from {x1, x2, x3}?

Q111

Q110

Q100

Q010

Q000

x1

x2

x3

V =

0 0 00 1 01 0 01 1 01 1 1

I For all x1, x2, x3 cannot get all 23 patterns: V has < 8 rows.For some x1, x2 can get all 22 patterns.

I VCdim(Q) = 2

I [packing numbers for Q as subset of L1(P ),

for each probability measure P on R2]

If Q1, . . . , QN ∈ Q and

P (Qi∆Qi′) > ε for 1 ≤ i < i′ ≤ N

then

N ≤ constant× (1/ε)2

I Hard to prove with exponent 2 (I think).

Easy to prove with exponent 4.

I Haussler (1995) got constant×(1/ε)VCdim for general VC classes

of sets.

I Fat shattering?

ε-shattering: Kearns and Schapire (1994, §6)

=

fat-shattering: Bartlett, Long, and Williamson (1996);

Anthony and Bartlett (2002, §11.3)

≈(not) stable: Talagrand (1987a);

shatter at levels α, β: Talagrand (1996a);

??? Talagrand (1984);

Fremlin??

I F = a set of real-valued functions on a set X

x = (x1, x2, . . . , xn) ∈ Xn

levels: α = (α1, . . . , αn) and β = (β1, . . . , βn) with βj ≥ αj

I F picks out pattern K ⊆ {1,2, . . . , n} at levels α and β means:There exists fK ∈ F such that

fK(xj)

≥ βj if j ∈ K≤ αj if j ∈ Kc

I S(x, α, β,F) = number of distinct patterns picked out ≤ 2n

I F shatters x at levels α and β means: S(x, α, β,F) = 2n

I sdim(ε,F) is largest n such that F shatters some x in Xn at somelevels α and β with βj ≥ αj + 2ε for all j

Example:

F = all increasing functions f on R with 0 ≤ f ≤ 1

I Why must we have

0 ≤ α1 ≤ · · · ≤ α3 + 2ε ≤ β3 ≤ α4 . . . ?

I Why is sdim(ε,F) ≤ (2ε)−1?

x1 x2 x3 x40

1

α1

α2?α4

β2

α3

β3

β1

?

I Suppose 0 ≤ f ≤ 1 for all f in F

and X1, X2, · · · ∼ iid P

I

∆n := supF |n−1∑i≤n f(Xi)−

∫f dP | = supF |Pnf − Pf |

I Talagrand (1987a) and Talagrand (1996a):

failure of ∆n → 0 a.s. (uniform SLLN)

iff

(roughly) ∃ subset A with PA > 0 for which (almost) every sample

from (nonatomic) P (· | A) can be shattered at some fixed levels α

and β (with αj ≡ α1 and βj ≡ β1)

Mendelson and Vershynin (2003)

I If 0 ≤ f ≤ 1 for all f in F

and P is a probability measure and α ≥ 1and f1, . . . , fN in F with∫

|fi − fi′|αdP > (cαε)

α for 1 ≤ i < i′ ≤ N

(a packing assumption) then

N ≤ (Cα/ε)6 sdim(ε,F)

for known constants cα and Cα.

I Good consequences for bounding ∆n

when X1, X2, · · · ∼ iid P if∫ 1

0

√sdim(ε,F) log(1/ε) dε <∞

M&V method:

I For some sample x1, . . . , xm from P ,

with m = something involving logN and ε,

discretize:

V [i, j] := bfi(xj)/εc

to get an N ×m matrix V with entries in Sp = {0,1, . . . , p}for p = b1/εc.

I For a good realization x1, . . . , xm get

♥ m−1∑j≤m |V [i, j]− V [i′, j]|α > C′α for 1 ≤ i < i′ ≤ N

for some magic constant C′α

I Key question:

For how many distinct (J, z) with J ⊆ Sp and z ∈ ZJ

can V “surround” (J, z) in the sense:

for each K ⊆ J there is an iK for which

V [iK, j]

≥ zj + 1 if j ∈ K≤ zj − 1 if j ∈ J\K

?

V =

6 3 25 1 14 2 31 0 52 6 4

J = (1,2) z = (3,2)0

1

2

3

4

5

6

I Answer:

If ♥ holds then

# distinct (J, z) surrounded by V is ≥√N − 1.

Majorizing measures

I Long and glorious history:

Fernique (1975),

. . . ,

Talagrand (1987b), Talagrand (1990),

Ledoux and Talagrand (1991, Chapter 11),

Talagrand (1996b), Talagrand (2001), Talagrand (2005),

Kwapien and Rosinski (2004),

Bednorz (2006), Bednorz (2007), . . .

(List woefully incomplete)

I Ψ : R+ → R+ convex, increasing, with Ψ(0) = 0

eg. Ψgaus(x) := exp(x2/2)− 1 useful for Gaussian processes

I Stochastic process {Zt : t ∈ T} indexed by metric space (T, d)

with ‖Zs − Zt‖Ψ ≤ d(s, t), that is,

PΨ

(|Zt − Zs|d(s, t)

)≤ 1 for all s 6= t

I Probability measure µ on T is a majorizing measure if

supt∈T

∫ diam(T )

0Ψ−1

(1

µB(t, r)

)dr <∞

where B(t, r) = closed ball of radius r around t.

I (Talagrand 1987b) for Gaussian process

P supt∈T

Zt <∞

if and only if there exists a majorizing measure (using Ψgaus)

I Actually, Talagrand gave explicit bounds.

One way to build a MM: (under mild conditions on Ψ)

I Tk is a maximal set of points separated by at least diam(T )/2k,

for k = 1,2, . . .

I Nk := #Tk

I µk = uniform distribution on Tk, mass 1/Nk at each point

I µ =∑∞k=1 2−kµk is a MM if

∑∞k=1 2−kΨ−1(Nk) <∞

I Other kinds of MM’s exist and are useful

I Use MM to construct nested partitions for “chaining argument”

(Talagrand 2005)

I Use MM as a local smoothing operator [Kwapien and Rosinski

(2004), Bednorz (2006), Bednorz (2007)]; get bounds on supre-

mum (and increments) of stochastic process

I Traditional chaining arguments seem to require “local complex-

ity” the same everywhere in T

I MM gives control where “local complexity” differs from one part

of T to another

Wild analogies and speculation

I Arguments for minimax rates of convergence of estimators often

use Bayes argument with uniform prior on a maximal set of points

separated by at least εn

I Some minimax arguments have a suggestive similarity to MM

arguments

I MM like a (possibly nonuniform) Bayes prior?

References

Anthony, M. and P. Bartlett (2002). Neural Network Learning:

Theoretical Foundations. Cambridge University Press.

Bartlett, P. L., P. M. Long, and R. C. Williamson (1996).

Fat-shattering and the learnability of real-valued functions.

Journal of Computer and System Sciences 52, 434–452.

Bednorz, W. (2006). A theorem on majorizing measures. An-

nals of Probability 34, 1771–1781.

Bednorz, W. (2007). Holder continuity of random processes.

Journal of Theoretical Probability 20, 917–934.

Fernique, X. (1975). Regularite des trajectoires des fonctions

aleatoires gaussiennes. Springer Lecture Notes in Mathe-

matics 480, 1–97.

Haussler, D. (1995). Sphere packing numbers for subsets of

the Boolean n-cube with bounded Vapnik-Chervonenkis di-

mension. Journal of Combinatorial Theory 69, 217–232.

Kearns, M. J. and R. E. Schapire (1994). Efficient distribution-

free learning of probabilistic concepts. Journal of Computer

and System Sciences 48, 464–497.

Kwapien, S. and J. Rosinski (2004). Sample Holder continuity

of stochastic processes and majorizing measures. Progress

in Probability 58, 155–163.

Ledoux, M. and M. Talagrand (1991). Probability in BanachSpaces: Isoperimetry and Processes. New York: Springer.

Mendelson, S. and R. Vershynin (2003). Entropy and the com-binatorial dimension. Inventiones Mathematicae 152, 37–55.

Talagrand, M. (1984). Pettis Integral and Measure Theory,Volume 307 of Memoirs AMS. American Mathematical So-ciety.

Talagrand, M. (1987a). The Glivenko-Cantelli problem. Annalsof Probability 15(3), 837–870.

Talagrand, M. (1987b). Regularity of gaussian processes. ActaMathematica 159, 99–149.

Talagrand, M. (1990). Sample boundedness of stochastic pro-cesses under increment conditions. Annals of Probability18(1), 1–49.

Talagrand, M. (1996a). The Glivenko-Cantelli problem, tenyears later. Journal of Theoretical Probability 9(2), 371–384.

Talagrand, M. (1996b). Majorizing measures: The genericchaining (in special invited paper). Annals of Probability24, 1049–1103.

Talagrand, M. (2001). Majorizing measures without measures.The Annals of Probability 29(1), 411–417.

Talagrand, M. (2005). The Generic Chaining: Upper and lowerbounds of stochastic processes. Springer-Verlag.

fat-shattering dimension and majorizing measures: two ways to...

Documents