minimax rates for homology inference
DESCRIPTION
Often, high dimensional data lie close to a low-dimensional submanifold and it is of interest to understand the geometry of these submanifolds. The homology groups of a manifold are important topological invariants that provide an algebraic summary of the manifold. These groups contain rich topological information, for instance, about the connected components, holes, tunnels and sometimes the dimension of the manifold. In this paper, we consider the statistical problem of estimating the homology of a manifold from noisy samples under several different noise models. We derive upper and lower bounds on the minimax risk for this problem. Our upper bounds are based on estimators which are constructed from a union of balls of appropriate radius around carefully selected points. In each case we establish complementary lower bounds using Le Cam's lemma.TRANSCRIPT
Minimax Rates for
Homology Inference
Don Sheehy
Joint work with Sivaraman Balakrishan,
Alessandro Rinaldo, Aarti Singh, and
Larry Wasserman
Something like a joke.
Something like a joke.
What is topological inference?
Something like a joke.
What is topological inference?
It’s when you infer the topology of a space given only a finite subset.
What is topological inference?
It’s when you infer the topology of a space given only a finite subset.
Something like a joke.
We add geometric and statistical hypotheses to make the problem well-posed.
Geometric Assumption: The underlying space is a smooth manifold M.
Statistical Assumption:The points are drawn i.i.d. from a distribution derived from M.
We add geometric and statistical hypotheses to make the problem well-posed.
Geometric Assumption: The underlying space is a smooth manifold M.
Statistical Assumption:The points are drawn i.i.d. from a distribution derived from M.
We add geometric and statistical hypotheses to make the problem well-posed.
Geometric Assumption: The underlying space is a smooth manifold M.
Statistical Assumption:The points are drawn i.i.d. from a distribution derived from M.
We add geometric and statistical hypotheses to make the problem well-posed.
Geometric Assumption: The underlying space is a smooth manifold M.
Statistical Assumption:The points are drawn i.i.d. from a distribution derived from M.
Input: n points from a d-manifold M in D-dimensions.
Input: n points from a d-manifold M in D-dimensions.
Output: The homology of M.
Input: n points from a d-manifold M in D-dimensions.
Output: The homology of M.
Upper bound: What is the worst case complexity?
Input: n points from a d-manifold M in D-dimensions.
Output: The homology of M.
Upper bound: What is the worst case complexity?
Lower Bound: What is the worst case complexity of the best possible algorithm?
Input: n points from a d-manifold M in D-dimensions.
Output: The homology of M.
Upper bound: What is the worst case complexity?
Lower Bound: What is the worst case complexity of the best possible algorithm?
sampled i.i.d.
Input: n points from a d-manifold M in D-dimensions.
Output: The homology of M.
Upper bound: What is the worst case complexity?
Lower Bound: What is the worst case complexity of the best possible algorithm?
sampled i.i.d. distributionsupported on
Input: n points from a d-manifold M in D-dimensions.
Output: The homology of M.
Upper bound: What is the worst case complexity?
Lower Bound: What is the worst case complexity of the best possible algorithm?
sampled i.i.d. distributionsupported on
with noise
Input: n points from a d-manifold M in D-dimensions.
Output: The homology of M.
Upper bound: What is the worst case complexity?
Lower Bound: What is the worst case complexity of the best possible algorithm?
sampled i.i.d. distributionsupported on
an estimate of with noise
Input: n points from a d-manifold M in D-dimensions.
Output: The homology of M.
Upper bound: What is the worst case complexity?
Lower Bound: What is the worst case complexity of the best possible algorithm?
sampled i.i.d. distributionsupported on
an estimate of with noise
probability of giving
a wrong answer
Input: n points from a d-manifold M in D-dimensions.
Output: The homology of M.
Upper bound: What is the worst case complexity?
Lower Bound: What is the worst case complexity of the best possible algorithm?
The Goal: Matching Bounds
(asymptotically)
sampled i.i.d. distributionsupported on
an estimate of with noise
probability of giving
a wrong answer
Minimax risk is the error probability of the best estimator on the hardest examples.
Minimax risk is the error probability of the best estimator on the hardest examples.
Minimax Risk: Rn = infH
supQ!Q
Qn(H != H(M))
Minimax risk is the error probability of the best estimator on the hardest examples.
Minimax Risk: Rn = infH
supQ!Q
Qn(H != H(M))
the best estimator
Minimax risk is the error probability of the best estimator on the hardest examples.
Minimax Risk: Rn = infH
supQ!Q
Qn(H != H(M))
the best estimator the hardest
distribution
Minimax risk is the error probability of the best estimator on the hardest examples.
Minimax Risk: Rn = infH
supQ!Q
Qn(H != H(M))
the best estimator the hardest
distributionproduct distribution
Minimax risk is the error probability of the best estimator on the hardest examples.
Minimax Risk: Rn = infH
supQ!Q
Qn(H != H(M))
the best estimator the hardest
distributionproduct distribution the true
homology
Minimax risk is the error probability of the best estimator on the hardest examples.
Minimax Risk: Rn = infH
supQ!Q
Qn(H != H(M))
Sample Complexity: n(!) = min{n : Rn ! !}
the best estimator the hardest
distributionproduct distribution the true
homology
We assume manifolds without boundary of bounded volume and reach.
We assume manifolds without boundary of bounded volume and reach.
Let M be the set of compact d-dimensionalRiemannian manifolds without boundary such that
We assume manifolds without boundary of bounded volume and reach.
Let M be the set of compact d-dimensionalRiemannian manifolds without boundary such that
M ! ballD(0, 1)1
We assume manifolds without boundary of bounded volume and reach.
Let M be the set of compact d-dimensionalRiemannian manifolds without boundary such that
M ! ballD(0, 1)1vol(M) ! cd2
We assume manifolds without boundary of bounded volume and reach.
Let M be the set of compact d-dimensionalRiemannian manifolds without boundary such that
M ! ballD(0, 1)1vol(M) ! cd2The reach of M is at most ! .3
We assume manifolds without boundary of bounded volume and reach.
Let M be the set of compact d-dimensionalRiemannian manifolds without boundary such that
Let P be the set of probability distributionssupported over M ! M with densities boundedfrom below by a constant a.
M ! ballD(0, 1)1vol(M) ! cd2The reach of M is at most ! .3
We consider 4 different noise models.Noiseless Clutter
Tubular Additive
We consider 4 different noise models.Noiseless Clutter
Tubular Additive
Q = P
We consider 4 different noise models.Noiseless Clutter
Tubular Additive
Q = PQ = (1! !)U + !P
P ! P
U is uniformon ball(0, 1)
We consider 4 different noise models.Noiseless Clutter
Tubular Additive
Q = PQ = (1! !)U + !P
P ! P
U is uniformon ball(0, 1)
Let QM,! be
uniform on M!.
Q = {QM,! : M ! M}
We consider 4 different noise models.Noiseless Clutter
Tubular Additive
Q = PQ = (1! !)U + !P
P ! P
U is uniformon ball(0, 1)
Let QM,! be
uniform on M!.
Q = {QM,! : M ! M}
Q = {P ! ! : P ! P}
! is Gaussian
with ! ! "
or ! has Fourier transformbounded away from 0and ! is fixed.
Le Cam’s Lemma is a powerful tool for proving minimax lower bounds.
Le Cam’s Lemma is a powerful tool for proving minimax lower bounds.
Lemma. Let Q be a set of distributions. Let !(Q) take valuesin a metric space (X, ") for Q ! Q. For any Q1, Q2 ! Q,
inf!
supQ!Q
EQn
!
"(!, !(Q))"
"1
8"(!(Q1), !(Q2))(1#TV(Q1, Q2))
2n
Le Cam’s Lemma is a powerful tool for proving minimax lower bounds.
Lemma. Let Q be a set of distributions. Let !(Q) take valuesin a metric space (X, ") for Q ! Q. For any Q1, Q2 ! Q,
inf!
supQ!Q
EQn
!
"(!, !(Q))"
"1
8"(!(Q1), !(Q2))(1#TV(Q1, Q2))
2n
For homology, use the trivial metric. !(x, y) =
!
0 if x = y
1 if x != y
Le Cam’s Lemma is a powerful tool for proving minimax lower bounds.
Lemma. Let Q be a set of distributions. Let !(Q) take valuesin a metric space (X, ") for Q ! Q. For any Q1, Q2 ! Q,
inf!
supQ!Q
EQn
!
"(!, !(Q))"
"1
8"(!(Q1), !(Q2))(1#TV(Q1, Q2))
2n
For homology, use the trivial metric. !(x, y) =
!
0 if x = y
1 if x != y
infH
supQ!Q
Qn(H != H(M)) "1
8(1# TV(Q1, Q2))
2n
Le Cam’s Lemma is a powerful tool for proving minimax lower bounds.
Lemma. Let Q be a set of distributions. Let !(Q) take valuesin a metric space (X, ") for Q ! Q. For any Q1, Q2 ! Q,
inf!
supQ!Q
EQn
!
"(!, !(Q))"
"1
8"(!(Q1), !(Q2))(1#TV(Q1, Q2))
2n
For homology, use the trivial metric. !(x, y) =
!
0 if x = y
1 if x != y
infH
supQ!Q
Qn(H != H(M)) "1
8(1# TV(Q1, Q2))
2nRn =
The lower bound requires two manifolds that are geometrically close but topologically distinct.
The lower bound requires two manifolds that are geometrically close but topologically distinct.
B = balld(0, 1! !) A = B \ balld(0, 2!)
The lower bound requires two manifolds that are geometrically close but topologically distinct.
B = balld(0, 1! !) A = B \ balld(0, 2!)
M1 = !(B! ) M2 = !(A! )
The lower bound requires two manifolds that are geometrically close but topologically distinct.
B = balld(0, 1! !) A = B \ balld(0, 2!)
The overlap
M1 = !(B! ) M2 = !(A! )
It suffices to bound the total variation distance.
It suffices to bound the total variation distance.Total Variation Distance:
TV(Q1, Q2) = supA
|Q1(A)!Q2(A)|
" amax{vol(M1 \M2), vol(M2 \M1)}
" Cda!d
It suffices to bound the total variation distance.
Minimax Risk:
Rn !1
8(1" TV(Q1, Q2)
2n !1
8(1" Cda!
d)2n !1
8e!2Cda!
dn
Total Variation Distance:
TV(Q1, Q2) = supA
|Q1(A)!Q2(A)|
" amax{vol(M1 \M2), vol(M2 \M1)}
" Cda!d
It suffices to bound the total variation distance.
Minimax Risk:
Rn !1
8(1" TV(Q1, Q2)
2n !1
8(1" Cda!
d)2n !1
8e!2Cda!
dn
Sampling Rate:n(!) !
!
1
"
"d
log1
#
Total Variation Distance:
TV(Q1, Q2) = supA
|Q1(A)!Q2(A)|
" amax{vol(M1 \M2), vol(M2 \M1)}
" Cda!d
The upper bound uses a union of balls to estimate the homology of M.
The upper bound uses a union of balls to estimate the homology of M.
The upper bound uses a union of balls to estimate the homology of M.
The upper bound uses a union of balls to estimate the homology of M.
Take a union of balls.1
The upper bound uses a union of balls to estimate the homology of M.
Take a union of balls.1Compute the homology of the resulting Cech complex.2
The upper bound uses a union of balls to estimate the homology of M.
Take a union of balls.1Compute the homology of the resulting Cech complex.2
The upper bound uses a union of balls to estimate the homology of M.
Take a union of balls.1Compute the homology of the resulting Cech complex.2
0 Denoise the data.
The upper bound uses a union of balls to estimate the homology of M.
Take a union of balls.1Compute the homology of the resulting Cech complex.2
0 Denoise the data.
The upper bound uses a union of balls to estimate the homology of M.
Take a union of balls.1Compute the homology of the resulting Cech complex.2
0 Denoise the data.
The upper bound uses a union of balls to estimate the homology of M.
Take a union of balls.1Compute the homology of the resulting Cech complex.2
0 Denoise the data.
The upper bound uses a union of balls to estimate the homology of M.
Take a union of balls.1Compute the homology of the resulting Cech complex.2
0 Denoise the data.
The upper bound uses a union of balls to estimate the homology of M.
Take a union of balls.1Compute the homology of the resulting Cech complex.2
0 Denoise the data.
To prove: The density is bounded from below near M and from above far from M.
Many fundamental problems are still open.
Many fundamental problems are still open.
1 Is the reach the right parameter?
Many fundamental problems are still open.
1 Is the reach the right parameter?
2 What about manifolds with boundary?
Many fundamental problems are still open.
1 Is the reach the right parameter?
2 What about manifolds with boundary?
3 Homotopy equivalence?
Many fundamental problems are still open.
1 Is the reach the right parameter?
2 What about manifolds with boundary?
3 Homotopy equivalence?
4 How to choose parameters?
Many fundamental problems are still open.
1 Is the reach the right parameter?
2 What about manifolds with boundary?
3 Homotopy equivalence?
4 How to choose parameters?
5 Are there efficient algorithms?
Thank you.