regularization prescriptions and convex duality: density ... · regularization prescriptions and...

Regularization prescriptions andconvex duality: density estimation and

Renyi entropies

Ivan Mizera

University of AlbertaDepartment of Mathematical and Statistical Sciences

Edmonton, Alberta, Canada

Linz, October 2008

joint work with Roger Koenker(University of Illinois at Urbana-Champaign)

Gratefully acknowledging the support of the

Natural Sciences and Engineering Research Council of Canada

Density estimation (say)

A useful heuristics: maximum likelihood

Given the datapoints X1,X2, . . . ,Xn, solve

n∏i=1

f(Xi) # maxf

!

or equivalently

−

n∑i=1

log f(Xi) # minf

!

under the side conditions

f > 0,

∫f = 1

1

Note that useful...

0 5 10 15 20 250

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

2

Dirac catastrophe!

3

Preventing the disaster for general case

• Sieves (...)

4


• Sieves (...)

• Regularization

−

n∑i=1

log f(Xi) # minf

! f > 0,

∫f = 1

4


• Sieves (...)

• Regularization

−

n∑i=1

log f(Xi) # minf

! J(f) 6Λ, f > 0,

∫f = 1

4


• Sieves (...)

• Regularization

−

n∑i=1

log f(xi) + λJ(f) # minf

! f > 0,

∫f = 1

4


• Sieves (...)

• Regularization

−

n∑i=1

log f(xi) + λJ(f) # minf

! f > 0,

∫f = 1

J(·) - penalty (penalizing complexity, lack of smoothness etc.)

for instance, J(f) =

∫|(log f) ′′| = TV((log f) ′)

or also J(f) =

∫|(log f) ′′′| = TV((log f) ′′)

Good (1971), Good and Gaskins (1971), Silverman (1982),Leonard (1978), Gu (2002), Wahba, Lin, and Leng (2002)

See also:Eggermont and LaRiccia (2001)Ramsay and Silverman (2006)Hartigan (2000), Hartigan and Hartigan (1985)Davies and Kovac (2004)

4

See also in particular

Roger Koenker and Ivan Mizera (2007)Density estimation by total variation regularization

Roger Koenker and Ivan Mizera (2006)The alter egos of the regularized maximum likelihood densityestimators: deregularized maximum-entropy, Shannon, Renyi,Simpson, Gini, and stretched strings

Roger Koenker, Ivan Mizera, and Jungmo Yoon (200?)What do kernel density estimators optimize?

Roger Koenker and Ivan Mizera (2008):Primal and dual formulations relevant for the numericalestimation of a probability density via regularization

Roger Koenker and Ivan Mizera (200?)Quasi-concave density estimation

http://www.stat.ualberta.ca/∼mizera/

http://www.econ.uiuc.edu/∼roger/

5

Preventing the disaster for special cases

• Shape constraint: monotonicity

−

n∑i=1

log f(Xi) # minf

! f > 0,

∫f = 1

6



−

n∑i=1

log f(Xi) # minf

! f decreasing, f > 0,

∫f = 1

Grenander (1956), Jongbloed (1998),Groeneboom, Jongbloed, and Wellner (2001),...

6



−

n∑i=1

log f(Xi) # minf


∫f = 1


• Shape constraint: (strong) unimodality

−

n∑i=1

log f(Xi) # minf

! f > 0,

∫f = 1

6



−

n∑i=1

log f(Xi) # minf


∫f = 1


• Shape constraint: (strong) unimodality

−

n∑i=1

log f(Xi) # minf

! − log f convex, f > 0,

∫f = 1

Eggermont and LaRiccia (2000), Walther (2000)

Rufibach and Dumbgen (2006)

Pal, Woodroofe, and Meyer (2006)

6

Note

Shape constraint: no regularization parameter to be set...

... but of course, we need to believe that the shape is plausible

7

Note



Regularization via TV penalty...

... vs log-concavity shape constraint:

The differential operator is the same,only the constraint is somewhat different∫

|(log f) ′′| 6Λ, in the dual |(log f) ′′| 6Λ

Log-concavity: (log f) ′′ 6 0

7

Note



Regularization via TV penalty...

... vs log-concavity shape constraint:

The differential operator is the same,only the constraint is somewhat different∫

|(log f) ′′| 6Λ, in the dual |(log f) ′′| 6Λ

Log-concavity: (log f) ′′ 6 0

Only the functional analysis may be a bit more difficult...

... so let us do the shape-constrained case first

7

The hidden charm of log-concave distributions

A density f is called log-concave if − log f is convex.

(Usual conventions: − log 0 = ∞, convex where finite, ...)

8




Schoenberg 1940’s, Karlin 1950’s (monotone likelihood ratio)Karlin (1968) - monograph about their mathematicsBarlow and Proschan (1975) - reliabilityFlinn and Heckman (1975) - social choiceCaplin and Nalebuff (1991a,b) - voting theoryDevroye (1984) - how to simulate from themMizera (1994) - M-estimators

8





Uniform, Normal, Exponential, Logistic, Weibull, Gamma...- all log-concave

If f is log-concave, then- it is unimodal (“strongly”)- the convolution with any unimodal density is unimodal- the convolution with any log-concave density is log-concave- f = e−g, with g convex...

8





Uniform, Normal, Exponential, Logistic, Weibull, Gamma...- all log-concave

If f is log-concave, then- it is unimodal (“strongly”)- the convolution with any unimodal density is unimodal- the convolution with any log-concave density is log-concave- f = e−g, with g convex...

No heavy tails! t-distributions (finance!): not log-concave (!!)

8

A convex problem

Let g = − log f; let K be the cone of convex functions.

The original problem is transformed:

n∑i=1

g(Xi) # ming

! g ∈K,

∫e−g = 1

9

A convex problem



n∑i=1

g(Xi) +

∫e−g # min

g! g ∈K

9

A convex problem



n∑i=1

g(Xi) +

∫e−g # min

g! g ∈K

and generalized: let ψ be convex and nonincreasing (like e−x)

n∑i=1

g(Xi) +

∫e−g # min

g! g ∈K

9

A convex problem



n∑i=1

g(Xi) +

∫e−g # min

g! g ∈K

and generalized: let ψ be convex and nonincreasing (like e−x)

n∑i=1

g(Xi) +

∫ψ(g) # min

g! g ∈K

9

Primal and dual

Recall: K is the cone of convex functions;ψ is convex and nonincreasing

The strong Fenchel dual of

1

n

n∑i=1

g(Xi) +

∫ψ(g)dx# min

g! g ∈K (P)

is

−

∫ψ∗(−f)dx# max

f! f =

d(Pn −G)

dx, G ∈K∗ (D)

Extremal relation: f = −ψ′(g).

For penalized estimation, in discretized setting: Koenker andMizera (2007b)

10

Remarks

ψ∗(y) = supx∈domψ

(yx−ψ(x)) is the conjugate of ψ

if primal solutions g are sought in some space, then dualsolutions G are sought in a dual space

for instance, if g ∈ C(X), and X is compact, then G ∈ C(X)∗,the space of (signed) Radon measures on X.

The equality f =d(Pn −G)

dxis thus a feasibility constraint

(for other G, the dual objective is −∞)

K∗ is the dual cone to K - a collection of (signed) Radonmeasures such that

∫gdG > 0 for any convex g.

Dual: good for computation...

11

Dual: good not only for computation

Couldn’t we have here heavy-tailed distribution too?

...possibly going beyond log-concavity?

Recall: the strong Fenchel dual of

1

n

n∑i=1

g(Xi) +

∫ψ(g)dx# min

g! g ∈K (P)

is

−

∫ψ∗(−f)dx# max

f! f =

d(Pn −G)

dx, G ∈K∗ (D)

Extremal relation: f = −ψ′(g).

12

Instance: maximum likelihood, α = 1

For ψ(x) = e−x, we have

1

n

n∑i=1

g(Xi) +

∫e−g # min

g! g ∈K (P)

−

∫f log fdx# max

f! f =

d(Pn −G)

dx, G ∈K∗ (D)

... a maximum entropy formulation

Extremal relation: f = e−g

g required convex → f log-concave

How about entropies alternative to Shannon entropy?

13

Renyi system

Renyi (1961,1965): entropies defined with the help of

(1 −α)−1 log(

∫fα(x)dx),

with Shannon entropy being a limiting form for α = 1.

Various entropies correspond to various known divergences:

α = 1: Shannon entropy, Kullback-Leibler divergence

α = 2: Renyi-Simpson-Gini entropy, Pearson’s χ2

α = 1/2: Hellinger’s distance

α = 0: reversed Kullback-Leibler

New heuristics:MLE → Shannon dual → Renyi duals → ? primals

14

ψ and ψ∗ for various α

! = 2

! = 1

! = 1/2

! = 0

! = 2

! = 1

! = 1/2

! = 0

15

Some properties for all α

The density estimators with Renyi entropies, as defined above,are:

• supported by the convex hull of the data

• the expected value of the estimated density is equal to thesample mean of the data

• the function g, appearing in the primal, is a polyhedralconvex function (that is, it is determined by its values at thedata points Xi, and is the maximal convex function minorizingthose)

• and the estimates are well-defined: the minimum of theprimal formulation is attained

16

Instance: α = 2

−

∫f2(y)dy = max

f! f =

d(Pn −G)

dy, G ∈K∗. (D)

1

n

n∑i=1

g(Xi) +1

2

∫g2dx# min

g! g ∈K (P)

Minimum Pearson χ2, maximum Renyi-Simpson-Gini entropy

Extremal relation: f = −g

g required convex → f concave

That yields a class more restrictive than log-concave

- and thus is not of interest for us!

17

But perhaps for others...

Replacing g by −f gives

−1

n

n∑i=1

f(Xi) +1

2

∫f2dx# min

g! subject to g ∈K

the objective function of “least squares estimator”Groeneboom, Jongbloed, and Wellner (2001)

A folk tune (in the penalized context):Aidu and Vapnik (1989), Terrell (1990)

... and more generally, the primal form for α > 1 isequivalent to the objective function of “minimum densitypower divergence estimators”, introduced by Basu, Harris,Hjort, and Jones (1998) in the context of parametric M-estimation.

18

De profundis: α = 0

Not explicitly a member of the Renyi family - nevertheless,a limit ∫

log fdy = maxf

! f =d(Pn −G)

dy, G ∈K∗, (D)

1

n

n∑i=1

g(Xi) −

∫loggdx = min

g∈C(X)! g ∈K. (P)

Empirical likelihood (Owen, 2001)

Extremal relation g = 1/f

the primal thus estimates the “sparsity function”

g required convex → 1/f convex

- that would yield a very nice family of functions...

... but numerically still fragile.

19

The hierarchy of ρ-convex functions

Hardy, Littlewood, and Polya (1934): means of order ρ

Avriel (1972): ρ-convex functions

ρ < 0: fρ convex

ρ = 0: log-concave

ρ > 0: fρ concave

The class of ρ-convex densities grows with decreasing ρ:

if ρ1 < ρ2 then every ρ2-convex is ρ1-convex

Every ρ-convex density is quasi-convex : has convex level sets

Our α corresponds to ρ = α− 1 - that is:

if we do the estimating prescription whose dual involvesthe Renyi α-entropy, then the result is guaranteed to liein the domain of (α− 1)-convex functions

20

So the winner is: α = 1/2

“Moderate progress within the limits of law”,“Hellinger selector”:∫√

fdx# maxf

! subject to f =d(Pn −G)

dx, G ∈K∗ (D)

1

n

n∑i=1

g(Xi) +

∫1

gdx# min

g∈C(X)! g ∈K (P)

Extremal relation: f = g−2

g required convex → f−1/2 convex (f is −1/2-convex)

- all log-concave

- all t family

the primal thus estimates f−1/2 (...rootosparsity)

21

Weibull, n = 200;left Shannon, right Hellinger

!4 !2 0 2 4 6 8 10 120

0.2

0.4

0.6

0.8

1

1.2

1.4

!4 !2 0 2 4 6 8 10 120

0.2

0.4

0.6

0.8

1

1.2

1.4

22

Another Weibull, n = 200;left Shannon, right Hellinger

!1 !0.5 0 0.5 1 1.5 2 2.5 3 3.50

0.5

1

1.5

2

2.5

3

3.5

!1 !0.5 0 0.5 1 1.5 2 2.5 3 3.50

0.5

1

1.5

2

2.5

3

3.5

23

Four points at the vertices of the square

24

Student data on criminal fingers

!6 !4 !2 0 2 4 6!6

!4

!2

0

2

4

6

25

Once again, but with logarithmic contours

!6 !4 !2 0 2 4 6!6

!4

!2

0

2

4

6

26

Simulated data: uniform distribution

!1.5 !1 !0.5 0 0.5 1 1.5!1.5

!1

!0.5

0

0.5

1

1.5

27

A panoramic view

!2!1

01

2

!1.5!1!0.500.511.5

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

28

Computation

Main problem: enforcing convexity optimization

Easy in dimension 1; in dimension 2, the most promising wayseems to be to employ a finite-difference scheme: estimatethe Hessian, the matrix of second derivatives, by finitedifferences...

...and then enforce this matrix to be positive semidefinite

That means: semidefinite programming...

...but with (slightly) nonlinear objective function.

In dimension two, one can express the semidefiniteness of thematrix by a rotated quadratic cone...

...and also the reciprocal value can be tricked in that way.

Thus: Hellinger selector turns out to be computationally easierthan (Shannon) maximum likelihood...

We acknowledge using a Danish commercial implementationcalled Mosek by Erling Andersen, and an open source code byMichael Saunders

See also Cule, Samworth, and Stewart (2008)

29

Summary

• We can estimate a density restricted to a broader domainthan log-concave - to include also heavy-tailed distributions.

• Generalizing the formulation dual to the maximum likelihoodin the family of Renyi entropies indexed by α, we obtain aninteresting family of divergence-based primal/dual estimators.

• Each yields the estimates in its corresponding ρ-convex class,in a natural way.

• Our choice is α = 1/2, which in dual picks a feasible densityclosest to the uniform, on the convex hull of the data, inHellinger distance.

• And yields −1/2-convex densities, which include all log-concave densities, but also t-family, that is, algebraical tails;seems like all practically important quasi-concave densities.

• And is in dimension 2 computationally somewhat moreconvenient than other possibilities.

30

Duality heuristics

Recall: penalized estimation, discretized setting

Primal:

−1

n

n∑i=1

g(xi) + J(−Dg) +

∫ψ(g) = min

g!

where (typically) J(−Dg) = λ

∫|g(k)|pp

Dual:

−

∫ψ∗(f) − J∗(h) = max

f,h! f =

d (Pn +D∗h)

dx� 0

where ψ∗ is again the conjugate to ψ

J∗ is the conjugate to J

D∗ is the operator adjoint to D

and strong duality yields f = ψ ′(g)

31

Instances

Silverman (1982), Leonard (1978): p = 2, k = 3

Gu (2002), Wahba, Lin, and Leng (2002): p = 2, k = 2

Davies and Kovac (2004), Hartigan (2000), Hartiganand Hartigan (1985): p = 1, k = 1

Koenker and Mizera (2006a,b,c): p = 1, k = 1, 2, 3

Recall: the conjugate of a norm is the indicator of the unitball in the dual norm. If J(−Dg) = λ

∫|g ′|, then the dual is

equivalent to

−

∫ψ∗(f) = max

f,h! f =

d (Pn +D∗h)

dx� 0 ‖h‖∞ 6 λ

If ψ(u) = eu, (which means that ψ∗(u) = u logu)

then the primal is a maximum likelihood prescription

penalized by∫

|(log f) ′| = TV(log f)

And the dual means: stretch h, the antiderivative of f, in theL∞ neighborhood (“tube”) of Pn... (and for other α as well!)

32

Stretching (“tauting”) strings

−5 −4 −3 −2 −1 0 1 2 3 4 5−0.2

0

0.2

0.4

0.6

0.8

1

1.2

Cumulative distribution function: tube with δ = 0.1

33

“tube” may be somewhat ambiguous...

!5 !4 !3 !2 !1 0 1 2 3 4 50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

34

...but nevertheless, there is one that matches

!5 !4 !3 !2 !1 0 1 2 3 40

0.05

0.1

0.15

0.2

0.25

...and the density estimate is its derivative(Koenker and Mizera 2006b).

35

regularization prescriptions and convex duality: density ... · regularization prescriptions and...

Documents