regularization prescriptions and convex duality: density ... · regularization prescriptions and...
TRANSCRIPT
Regularization prescriptions andconvex duality: density estimation and
Renyi entropies
Ivan Mizera
University of AlbertaDepartment of Mathematical and Statistical Sciences
Edmonton, Alberta, Canada
Linz, October 2008
joint work with Roger Koenker(University of Illinois at Urbana-Champaign)
Gratefully acknowledging the support of the
Natural Sciences and Engineering Research Council of Canada
Density estimation (say)
A useful heuristics: maximum likelihood
Given the datapoints X1,X2, . . . ,Xn, solve
n∏i=1
f(Xi) # maxf
!
or equivalently
−
n∑i=1
log f(Xi) # minf
!
under the side conditions
f > 0,
∫f = 1
1
Note that useful...
0 5 10 15 20 250
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
2
Dirac catastrophe!
3
Preventing the disaster for general case
• Sieves (...)
4
Preventing the disaster for general case
• Sieves (...)
• Regularization
−
n∑i=1
log f(Xi) # minf
! f > 0,
∫f = 1
4
Preventing the disaster for general case
• Sieves (...)
• Regularization
−
n∑i=1
log f(Xi) # minf
! J(f) 6Λ, f > 0,
∫f = 1
4
Preventing the disaster for general case
• Sieves (...)
• Regularization
−
n∑i=1
log f(xi) + λJ(f) # minf
! f > 0,
∫f = 1
4
Preventing the disaster for general case
• Sieves (...)
• Regularization
−
n∑i=1
log f(xi) + λJ(f) # minf
! f > 0,
∫f = 1
J(·) - penalty (penalizing complexity, lack of smoothness etc.)
for instance, J(f) =
∫|(log f) ′′| = TV((log f) ′)
or also J(f) =
∫|(log f) ′′′| = TV((log f) ′′)
Good (1971), Good and Gaskins (1971), Silverman (1982),Leonard (1978), Gu (2002), Wahba, Lin, and Leng (2002)
See also:Eggermont and LaRiccia (2001)Ramsay and Silverman (2006)Hartigan (2000), Hartigan and Hartigan (1985)Davies and Kovac (2004)
4
See also in particular
Roger Koenker and Ivan Mizera (2007)Density estimation by total variation regularization
Roger Koenker and Ivan Mizera (2006)The alter egos of the regularized maximum likelihood densityestimators: deregularized maximum-entropy, Shannon, Renyi,Simpson, Gini, and stretched strings
Roger Koenker, Ivan Mizera, and Jungmo Yoon (200?)What do kernel density estimators optimize?
Roger Koenker and Ivan Mizera (2008):Primal and dual formulations relevant for the numericalestimation of a probability density via regularization
Roger Koenker and Ivan Mizera (200?)Quasi-concave density estimation
http://www.stat.ualberta.ca/∼mizera/
http://www.econ.uiuc.edu/∼roger/
5
Preventing the disaster for special cases
• Shape constraint: monotonicity
−
n∑i=1
log f(Xi) # minf
! f > 0,
∫f = 1
6
Preventing the disaster for special cases
• Shape constraint: monotonicity
−
n∑i=1
log f(Xi) # minf
! f decreasing, f > 0,
∫f = 1
Grenander (1956), Jongbloed (1998),Groeneboom, Jongbloed, and Wellner (2001),...
6
Preventing the disaster for special cases
• Shape constraint: monotonicity
−
n∑i=1
log f(Xi) # minf
! f decreasing, f > 0,
∫f = 1
Grenander (1956), Jongbloed (1998),Groeneboom, Jongbloed, and Wellner (2001),...
• Shape constraint: (strong) unimodality
−
n∑i=1
log f(Xi) # minf
! f > 0,
∫f = 1
6
Preventing the disaster for special cases
• Shape constraint: monotonicity
−
n∑i=1
log f(Xi) # minf
! f decreasing, f > 0,
∫f = 1
Grenander (1956), Jongbloed (1998),Groeneboom, Jongbloed, and Wellner (2001),...
• Shape constraint: (strong) unimodality
−
n∑i=1
log f(Xi) # minf
! − log f convex, f > 0,
∫f = 1
Eggermont and LaRiccia (2000), Walther (2000)
Rufibach and Dumbgen (2006)
Pal, Woodroofe, and Meyer (2006)
6
Note
Shape constraint: no regularization parameter to be set...
... but of course, we need to believe that the shape is plausible
7
Note
Shape constraint: no regularization parameter to be set...
... but of course, we need to believe that the shape is plausible
Regularization via TV penalty...
... vs log-concavity shape constraint:
The differential operator is the same,only the constraint is somewhat different∫
|(log f) ′′| 6Λ, in the dual |(log f) ′′| 6Λ
Log-concavity: (log f) ′′ 6 0
7
Note
Shape constraint: no regularization parameter to be set...
... but of course, we need to believe that the shape is plausible
Regularization via TV penalty...
... vs log-concavity shape constraint:
The differential operator is the same,only the constraint is somewhat different∫
|(log f) ′′| 6Λ, in the dual |(log f) ′′| 6Λ
Log-concavity: (log f) ′′ 6 0
Only the functional analysis may be a bit more difficult...
... so let us do the shape-constrained case first
7
The hidden charm of log-concave distributions
A density f is called log-concave if − log f is convex.
(Usual conventions: − log 0 = ∞, convex where finite, ...)
8
The hidden charm of log-concave distributions
A density f is called log-concave if − log f is convex.
(Usual conventions: − log 0 = ∞, convex where finite, ...)
Schoenberg 1940’s, Karlin 1950’s (monotone likelihood ratio)Karlin (1968) - monograph about their mathematicsBarlow and Proschan (1975) - reliabilityFlinn and Heckman (1975) - social choiceCaplin and Nalebuff (1991a,b) - voting theoryDevroye (1984) - how to simulate from themMizera (1994) - M-estimators
8
The hidden charm of log-concave distributions
A density f is called log-concave if − log f is convex.
(Usual conventions: − log 0 = ∞, convex where finite, ...)
Schoenberg 1940’s, Karlin 1950’s (monotone likelihood ratio)Karlin (1968) - monograph about their mathematicsBarlow and Proschan (1975) - reliabilityFlinn and Heckman (1975) - social choiceCaplin and Nalebuff (1991a,b) - voting theoryDevroye (1984) - how to simulate from themMizera (1994) - M-estimators
Uniform, Normal, Exponential, Logistic, Weibull, Gamma...- all log-concave
If f is log-concave, then- it is unimodal (“strongly”)- the convolution with any unimodal density is unimodal- the convolution with any log-concave density is log-concave- f = e−g, with g convex...
8
The hidden charm of log-concave distributions
A density f is called log-concave if − log f is convex.
(Usual conventions: − log 0 = ∞, convex where finite, ...)
Schoenberg 1940’s, Karlin 1950’s (monotone likelihood ratio)Karlin (1968) - monograph about their mathematicsBarlow and Proschan (1975) - reliabilityFlinn and Heckman (1975) - social choiceCaplin and Nalebuff (1991a,b) - voting theoryDevroye (1984) - how to simulate from themMizera (1994) - M-estimators
Uniform, Normal, Exponential, Logistic, Weibull, Gamma...- all log-concave
If f is log-concave, then- it is unimodal (“strongly”)- the convolution with any unimodal density is unimodal- the convolution with any log-concave density is log-concave- f = e−g, with g convex...
No heavy tails! t-distributions (finance!): not log-concave (!!)
8
A convex problem
Let g = − log f; let K be the cone of convex functions.
The original problem is transformed:
n∑i=1
g(Xi) # ming
! g ∈K,
∫e−g = 1
9
A convex problem
Let g = − log f; let K be the cone of convex functions.
The original problem is transformed:
n∑i=1
g(Xi) +
∫e−g # min
g! g ∈K
9
A convex problem
Let g = − log f; let K be the cone of convex functions.
The original problem is transformed:
n∑i=1
g(Xi) +
∫e−g # min
g! g ∈K
and generalized: let ψ be convex and nonincreasing (like e−x)
n∑i=1
g(Xi) +
∫e−g # min
g! g ∈K
9
A convex problem
Let g = − log f; let K be the cone of convex functions.
The original problem is transformed:
n∑i=1
g(Xi) +
∫e−g # min
g! g ∈K
and generalized: let ψ be convex and nonincreasing (like e−x)
n∑i=1
g(Xi) +
∫ψ(g) # min
g! g ∈K
9
Primal and dual
Recall: K is the cone of convex functions;ψ is convex and nonincreasing
The strong Fenchel dual of
1
n
n∑i=1
g(Xi) +
∫ψ(g)dx# min
g! g ∈K (P)
is
−
∫ψ∗(−f)dx# max
f! f =
d(Pn −G)
dx, G ∈K∗ (D)
Extremal relation: f = −ψ′(g).
For penalized estimation, in discretized setting: Koenker andMizera (2007b)
10
Remarks
ψ∗(y) = supx∈domψ
(yx−ψ(x)) is the conjugate of ψ
if primal solutions g are sought in some space, then dualsolutions G are sought in a dual space
for instance, if g ∈ C(X), and X is compact, then G ∈ C(X)∗,the space of (signed) Radon measures on X.
The equality f =d(Pn −G)
dxis thus a feasibility constraint
(for other G, the dual objective is −∞)
K∗ is the dual cone to K - a collection of (signed) Radonmeasures such that
∫gdG > 0 for any convex g.
Dual: good for computation...
11
Dual: good not only for computation
Couldn’t we have here heavy-tailed distribution too?
...possibly going beyond log-concavity?
Recall: the strong Fenchel dual of
1
n
n∑i=1
g(Xi) +
∫ψ(g)dx# min
g! g ∈K (P)
is
−
∫ψ∗(−f)dx# max
f! f =
d(Pn −G)
dx, G ∈K∗ (D)
Extremal relation: f = −ψ′(g).
12
Instance: maximum likelihood, α = 1
For ψ(x) = e−x, we have
1
n
n∑i=1
g(Xi) +
∫e−g # min
g! g ∈K (P)
−
∫f log fdx# max
f! f =
d(Pn −G)
dx, G ∈K∗ (D)
... a maximum entropy formulation
Extremal relation: f = e−g
g required convex → f log-concave
How about entropies alternative to Shannon entropy?
13
Renyi system
Renyi (1961,1965): entropies defined with the help of
(1 −α)−1 log(
∫fα(x)dx),
with Shannon entropy being a limiting form for α = 1.
Various entropies correspond to various known divergences:
α = 1: Shannon entropy, Kullback-Leibler divergence
α = 2: Renyi-Simpson-Gini entropy, Pearson’s χ2
α = 1/2: Hellinger’s distance
α = 0: reversed Kullback-Leibler
New heuristics:MLE → Shannon dual → Renyi duals → ? primals
14
ψ and ψ∗ for various α
! = 2
! = 1
! = 1/2
! = 0
! = 2
! = 1
! = 1/2
! = 0
15
Some properties for all α
The density estimators with Renyi entropies, as defined above,are:
• supported by the convex hull of the data
• the expected value of the estimated density is equal to thesample mean of the data
• the function g, appearing in the primal, is a polyhedralconvex function (that is, it is determined by its values at thedata points Xi, and is the maximal convex function minorizingthose)
• and the estimates are well-defined: the minimum of theprimal formulation is attained
16
Instance: α = 2
−
∫f2(y)dy = max
f! f =
d(Pn −G)
dy, G ∈K∗. (D)
1
n
n∑i=1
g(Xi) +1
2
∫g2dx# min
g! g ∈K (P)
Minimum Pearson χ2, maximum Renyi-Simpson-Gini entropy
Extremal relation: f = −g
g required convex → f concave
That yields a class more restrictive than log-concave
- and thus is not of interest for us!
17
But perhaps for others...
Replacing g by −f gives
−1
n
n∑i=1
f(Xi) +1
2
∫f2dx# min
g! subject to g ∈K
the objective function of “least squares estimator”Groeneboom, Jongbloed, and Wellner (2001)
A folk tune (in the penalized context):Aidu and Vapnik (1989), Terrell (1990)
... and more generally, the primal form for α > 1 isequivalent to the objective function of “minimum densitypower divergence estimators”, introduced by Basu, Harris,Hjort, and Jones (1998) in the context of parametric M-estimation.
18
De profundis: α = 0
Not explicitly a member of the Renyi family - nevertheless,a limit ∫
log fdy = maxf
! f =d(Pn −G)
dy, G ∈K∗, (D)
1
n
n∑i=1
g(Xi) −
∫loggdx = min
g∈C(X)! g ∈K. (P)
Empirical likelihood (Owen, 2001)
Extremal relation g = 1/f
the primal thus estimates the “sparsity function”
g required convex → 1/f convex
- that would yield a very nice family of functions...
... but numerically still fragile.
19
The hierarchy of ρ-convex functions
Hardy, Littlewood, and Polya (1934): means of order ρ
Avriel (1972): ρ-convex functions
ρ < 0: fρ convex
ρ = 0: log-concave
ρ > 0: fρ concave
The class of ρ-convex densities grows with decreasing ρ:
if ρ1 < ρ2 then every ρ2-convex is ρ1-convex
Every ρ-convex density is quasi-convex : has convex level sets
Our α corresponds to ρ = α− 1 - that is:
if we do the estimating prescription whose dual involvesthe Renyi α-entropy, then the result is guaranteed to liein the domain of (α− 1)-convex functions
20
So the winner is: α = 1/2
“Moderate progress within the limits of law”,“Hellinger selector”:∫√
fdx# maxf
! subject to f =d(Pn −G)
dx, G ∈K∗ (D)
1
n
n∑i=1
g(Xi) +
∫1
gdx# min
g∈C(X)! g ∈K (P)
Extremal relation: f = g−2
g required convex → f−1/2 convex (f is −1/2-convex)
- all log-concave
- all t family
the primal thus estimates f−1/2 (...rootosparsity)
21
Weibull, n = 200;left Shannon, right Hellinger
!4 !2 0 2 4 6 8 10 120
0.2
0.4
0.6
0.8
1
1.2
1.4
!4 !2 0 2 4 6 8 10 120
0.2
0.4
0.6
0.8
1
1.2
1.4
22
Another Weibull, n = 200;left Shannon, right Hellinger
!1 !0.5 0 0.5 1 1.5 2 2.5 3 3.50
0.5
1
1.5
2
2.5
3
3.5
!1 !0.5 0 0.5 1 1.5 2 2.5 3 3.50
0.5
1
1.5
2
2.5
3
3.5
23
Four points at the vertices of the square
24
Student data on criminal fingers
!6 !4 !2 0 2 4 6!6
!4
!2
0
2
4
6
25
Once again, but with logarithmic contours
!6 !4 !2 0 2 4 6!6
!4
!2
0
2
4
6
26
Simulated data: uniform distribution
!1.5 !1 !0.5 0 0.5 1 1.5!1.5
!1
!0.5
0
0.5
1
1.5
27
A panoramic view
!2!1
01
2
!1.5!1!0.500.511.5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
28
Computation
Main problem: enforcing convexity optimization
Easy in dimension 1; in dimension 2, the most promising wayseems to be to employ a finite-difference scheme: estimatethe Hessian, the matrix of second derivatives, by finitedifferences...
...and then enforce this matrix to be positive semidefinite
That means: semidefinite programming...
...but with (slightly) nonlinear objective function.
In dimension two, one can express the semidefiniteness of thematrix by a rotated quadratic cone...
...and also the reciprocal value can be tricked in that way.
Thus: Hellinger selector turns out to be computationally easierthan (Shannon) maximum likelihood...
We acknowledge using a Danish commercial implementationcalled Mosek by Erling Andersen, and an open source code byMichael Saunders
See also Cule, Samworth, and Stewart (2008)
29
Summary
• We can estimate a density restricted to a broader domainthan log-concave - to include also heavy-tailed distributions.
• Generalizing the formulation dual to the maximum likelihoodin the family of Renyi entropies indexed by α, we obtain aninteresting family of divergence-based primal/dual estimators.
• Each yields the estimates in its corresponding ρ-convex class,in a natural way.
• Our choice is α = 1/2, which in dual picks a feasible densityclosest to the uniform, on the convex hull of the data, inHellinger distance.
• And yields −1/2-convex densities, which include all log-concave densities, but also t-family, that is, algebraical tails;seems like all practically important quasi-concave densities.
• And is in dimension 2 computationally somewhat moreconvenient than other possibilities.
30
Duality heuristics
Recall: penalized estimation, discretized setting
Primal:
−1
n
n∑i=1
g(xi) + J(−Dg) +
∫ψ(g) = min
g!
where (typically) J(−Dg) = λ
∫|g(k)|pp
Dual:
−
∫ψ∗(f) − J∗(h) = max
f,h! f =
d (Pn +D∗h)
dx� 0
where ψ∗ is again the conjugate to ψ
J∗ is the conjugate to J
D∗ is the operator adjoint to D
and strong duality yields f = ψ ′(g)
31
Instances
Silverman (1982), Leonard (1978): p = 2, k = 3
Gu (2002), Wahba, Lin, and Leng (2002): p = 2, k = 2
Davies and Kovac (2004), Hartigan (2000), Hartiganand Hartigan (1985): p = 1, k = 1
Koenker and Mizera (2006a,b,c): p = 1, k = 1, 2, 3
Recall: the conjugate of a norm is the indicator of the unitball in the dual norm. If J(−Dg) = λ
∫|g ′|, then the dual is
equivalent to
−
∫ψ∗(f) = max
f,h! f =
d (Pn +D∗h)
dx� 0 ‖h‖∞ 6 λ
If ψ(u) = eu, (which means that ψ∗(u) = u logu)
then the primal is a maximum likelihood prescription
penalized by∫
|(log f) ′| = TV(log f)
And the dual means: stretch h, the antiderivative of f, in theL∞ neighborhood (“tube”) of Pn... (and for other α as well!)
32
Stretching (“tauting”) strings
−5 −4 −3 −2 −1 0 1 2 3 4 5−0.2
0
0.2
0.4
0.6
0.8
1
1.2
Cumulative distribution function: tube with δ = 0.1
33
“tube” may be somewhat ambiguous...
!5 !4 !3 !2 !1 0 1 2 3 4 50
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
34
...but nevertheless, there is one that matches
!5 !4 !3 !2 !1 0 1 2 3 40
0.05
0.1
0.15
0.2
0.25
...and the density estimate is its derivative(Koenker and Mizera 2006b).
35