minimax rates for homology inference

70
Minimax Rates for Homology Inference Don Sheehy Joint work with Sivaraman Balakrishan, Alessandro Rinaldo, Aarti Singh, and Larry Wasserman

Upload: don-sheehy

Post on 11-May-2015

180 views

Category:

Documents


0 download

DESCRIPTION

Often, high dimensional data lie close to a low-dimensional submanifold and it is of interest to understand the geometry of these submanifolds. The homology groups of a manifold are important topological invariants that provide an algebraic summary of the manifold. These groups contain rich topological information, for instance, about the connected components, holes, tunnels and sometimes the dimension of the manifold. In this paper, we consider the statistical problem of estimating the homology of a manifold from noisy samples under several different noise models. We derive upper and lower bounds on the minimax risk for this problem. Our upper bounds are based on estimators which are constructed from a union of balls of appropriate radius around carefully selected points. In each case we establish complementary lower bounds using Le Cam's lemma.

TRANSCRIPT

Page 1: Minimax Rates for Homology Inference

Minimax Rates for

Homology Inference

Don Sheehy

Joint work with Sivaraman Balakrishan,

Alessandro Rinaldo, Aarti Singh, and

Larry Wasserman

Page 2: Minimax Rates for Homology Inference

Something like a joke.

Page 3: Minimax Rates for Homology Inference

Something like a joke.

What is topological inference?

Page 4: Minimax Rates for Homology Inference

Something like a joke.

What is topological inference?

It’s when you infer the topology of a space given only a finite subset.

Page 5: Minimax Rates for Homology Inference

What is topological inference?

It’s when you infer the topology of a space given only a finite subset.

Something like a joke.

Page 6: Minimax Rates for Homology Inference

We add geometric and statistical hypotheses to make the problem well-posed.

Geometric Assumption: The underlying space is a smooth manifold M.

Statistical Assumption:The points are drawn i.i.d. from a distribution derived from M.

Page 7: Minimax Rates for Homology Inference

We add geometric and statistical hypotheses to make the problem well-posed.

Geometric Assumption: The underlying space is a smooth manifold M.

Statistical Assumption:The points are drawn i.i.d. from a distribution derived from M.

Page 8: Minimax Rates for Homology Inference

We add geometric and statistical hypotheses to make the problem well-posed.

Geometric Assumption: The underlying space is a smooth manifold M.

Statistical Assumption:The points are drawn i.i.d. from a distribution derived from M.

Page 9: Minimax Rates for Homology Inference

We add geometric and statistical hypotheses to make the problem well-posed.

Geometric Assumption: The underlying space is a smooth manifold M.

Statistical Assumption:The points are drawn i.i.d. from a distribution derived from M.

Page 10: Minimax Rates for Homology Inference
Page 11: Minimax Rates for Homology Inference

Input: n points from a d-manifold M in D-dimensions.

Page 12: Minimax Rates for Homology Inference

Input: n points from a d-manifold M in D-dimensions.

Output: The homology of M.

Page 13: Minimax Rates for Homology Inference

Input: n points from a d-manifold M in D-dimensions.

Output: The homology of M.

Upper bound: What is the worst case complexity?

Page 14: Minimax Rates for Homology Inference

Input: n points from a d-manifold M in D-dimensions.

Output: The homology of M.

Upper bound: What is the worst case complexity?

Lower Bound: What is the worst case complexity of the best possible algorithm?

Page 15: Minimax Rates for Homology Inference

Input: n points from a d-manifold M in D-dimensions.

Output: The homology of M.

Upper bound: What is the worst case complexity?

Lower Bound: What is the worst case complexity of the best possible algorithm?

sampled i.i.d.

Page 16: Minimax Rates for Homology Inference

Input: n points from a d-manifold M in D-dimensions.

Output: The homology of M.

Upper bound: What is the worst case complexity?

Lower Bound: What is the worst case complexity of the best possible algorithm?

sampled i.i.d. distributionsupported on

Page 17: Minimax Rates for Homology Inference

Input: n points from a d-manifold M in D-dimensions.

Output: The homology of M.

Upper bound: What is the worst case complexity?

Lower Bound: What is the worst case complexity of the best possible algorithm?

sampled i.i.d. distributionsupported on

with noise

Page 18: Minimax Rates for Homology Inference

Input: n points from a d-manifold M in D-dimensions.

Output: The homology of M.

Upper bound: What is the worst case complexity?

Lower Bound: What is the worst case complexity of the best possible algorithm?

sampled i.i.d. distributionsupported on

an estimate of with noise

Page 19: Minimax Rates for Homology Inference

Input: n points from a d-manifold M in D-dimensions.

Output: The homology of M.

Upper bound: What is the worst case complexity?

Lower Bound: What is the worst case complexity of the best possible algorithm?

sampled i.i.d. distributionsupported on

an estimate of with noise

probability of giving

a wrong answer

Page 20: Minimax Rates for Homology Inference

Input: n points from a d-manifold M in D-dimensions.

Output: The homology of M.

Upper bound: What is the worst case complexity?

Lower Bound: What is the worst case complexity of the best possible algorithm?

The Goal: Matching Bounds

(asymptotically)

sampled i.i.d. distributionsupported on

an estimate of with noise

probability of giving

a wrong answer

Page 21: Minimax Rates for Homology Inference

Minimax risk is the error probability of the best estimator on the hardest examples.

Page 22: Minimax Rates for Homology Inference

Minimax risk is the error probability of the best estimator on the hardest examples.

Minimax Risk: Rn = infH

supQ!Q

Qn(H != H(M))

Page 23: Minimax Rates for Homology Inference

Minimax risk is the error probability of the best estimator on the hardest examples.

Minimax Risk: Rn = infH

supQ!Q

Qn(H != H(M))

the best estimator

Page 24: Minimax Rates for Homology Inference

Minimax risk is the error probability of the best estimator on the hardest examples.

Minimax Risk: Rn = infH

supQ!Q

Qn(H != H(M))

the best estimator the hardest

distribution

Page 25: Minimax Rates for Homology Inference

Minimax risk is the error probability of the best estimator on the hardest examples.

Minimax Risk: Rn = infH

supQ!Q

Qn(H != H(M))

the best estimator the hardest

distributionproduct distribution

Page 26: Minimax Rates for Homology Inference

Minimax risk is the error probability of the best estimator on the hardest examples.

Minimax Risk: Rn = infH

supQ!Q

Qn(H != H(M))

the best estimator the hardest

distributionproduct distribution the true

homology

Page 27: Minimax Rates for Homology Inference

Minimax risk is the error probability of the best estimator on the hardest examples.

Minimax Risk: Rn = infH

supQ!Q

Qn(H != H(M))

Sample Complexity: n(!) = min{n : Rn ! !}

the best estimator the hardest

distributionproduct distribution the true

homology

Page 28: Minimax Rates for Homology Inference

We assume manifolds without boundary of bounded volume and reach.

Page 29: Minimax Rates for Homology Inference

We assume manifolds without boundary of bounded volume and reach.

Let M be the set of compact d-dimensionalRiemannian manifolds without boundary such that

Page 30: Minimax Rates for Homology Inference

We assume manifolds without boundary of bounded volume and reach.

Let M be the set of compact d-dimensionalRiemannian manifolds without boundary such that

M ! ballD(0, 1)1

Page 31: Minimax Rates for Homology Inference

We assume manifolds without boundary of bounded volume and reach.

Let M be the set of compact d-dimensionalRiemannian manifolds without boundary such that

M ! ballD(0, 1)1vol(M) ! cd2

Page 32: Minimax Rates for Homology Inference

We assume manifolds without boundary of bounded volume and reach.

Let M be the set of compact d-dimensionalRiemannian manifolds without boundary such that

M ! ballD(0, 1)1vol(M) ! cd2The reach of M is at most ! .3

Page 33: Minimax Rates for Homology Inference

We assume manifolds without boundary of bounded volume and reach.

Let M be the set of compact d-dimensionalRiemannian manifolds without boundary such that

Let P be the set of probability distributionssupported over M ! M with densities boundedfrom below by a constant a.

M ! ballD(0, 1)1vol(M) ! cd2The reach of M is at most ! .3

Page 34: Minimax Rates for Homology Inference

We consider 4 different noise models.Noiseless Clutter

Tubular Additive

Page 35: Minimax Rates for Homology Inference

We consider 4 different noise models.Noiseless Clutter

Tubular Additive

Q = P

Page 36: Minimax Rates for Homology Inference

We consider 4 different noise models.Noiseless Clutter

Tubular Additive

Q = PQ = (1! !)U + !P

P ! P

U is uniformon ball(0, 1)

Page 37: Minimax Rates for Homology Inference

We consider 4 different noise models.Noiseless Clutter

Tubular Additive

Q = PQ = (1! !)U + !P

P ! P

U is uniformon ball(0, 1)

Let QM,! be

uniform on M!.

Q = {QM,! : M ! M}

Page 38: Minimax Rates for Homology Inference

We consider 4 different noise models.Noiseless Clutter

Tubular Additive

Q = PQ = (1! !)U + !P

P ! P

U is uniformon ball(0, 1)

Let QM,! be

uniform on M!.

Q = {QM,! : M ! M}

Q = {P ! ! : P ! P}

! is Gaussian

with ! ! "

or ! has Fourier transformbounded away from 0and ! is fixed.

Page 39: Minimax Rates for Homology Inference

Le Cam’s Lemma is a powerful tool for proving minimax lower bounds.

Page 40: Minimax Rates for Homology Inference

Le Cam’s Lemma is a powerful tool for proving minimax lower bounds.

Lemma. Let Q be a set of distributions. Let !(Q) take valuesin a metric space (X, ") for Q ! Q. For any Q1, Q2 ! Q,

inf!

supQ!Q

EQn

!

"(!, !(Q))"

"1

8"(!(Q1), !(Q2))(1#TV(Q1, Q2))

2n

Page 41: Minimax Rates for Homology Inference

Le Cam’s Lemma is a powerful tool for proving minimax lower bounds.

Lemma. Let Q be a set of distributions. Let !(Q) take valuesin a metric space (X, ") for Q ! Q. For any Q1, Q2 ! Q,

inf!

supQ!Q

EQn

!

"(!, !(Q))"

"1

8"(!(Q1), !(Q2))(1#TV(Q1, Q2))

2n

For homology, use the trivial metric. !(x, y) =

!

0 if x = y

1 if x != y

Page 42: Minimax Rates for Homology Inference

Le Cam’s Lemma is a powerful tool for proving minimax lower bounds.

Lemma. Let Q be a set of distributions. Let !(Q) take valuesin a metric space (X, ") for Q ! Q. For any Q1, Q2 ! Q,

inf!

supQ!Q

EQn

!

"(!, !(Q))"

"1

8"(!(Q1), !(Q2))(1#TV(Q1, Q2))

2n

For homology, use the trivial metric. !(x, y) =

!

0 if x = y

1 if x != y

infH

supQ!Q

Qn(H != H(M)) "1

8(1# TV(Q1, Q2))

2n

Page 43: Minimax Rates for Homology Inference

Le Cam’s Lemma is a powerful tool for proving minimax lower bounds.

Lemma. Let Q be a set of distributions. Let !(Q) take valuesin a metric space (X, ") for Q ! Q. For any Q1, Q2 ! Q,

inf!

supQ!Q

EQn

!

"(!, !(Q))"

"1

8"(!(Q1), !(Q2))(1#TV(Q1, Q2))

2n

For homology, use the trivial metric. !(x, y) =

!

0 if x = y

1 if x != y

infH

supQ!Q

Qn(H != H(M)) "1

8(1# TV(Q1, Q2))

2nRn =

Page 44: Minimax Rates for Homology Inference

The lower bound requires two manifolds that are geometrically close but topologically distinct.

Page 45: Minimax Rates for Homology Inference

The lower bound requires two manifolds that are geometrically close but topologically distinct.

B = balld(0, 1! !) A = B \ balld(0, 2!)

Page 46: Minimax Rates for Homology Inference

The lower bound requires two manifolds that are geometrically close but topologically distinct.

B = balld(0, 1! !) A = B \ balld(0, 2!)

M1 = !(B! ) M2 = !(A! )

Page 47: Minimax Rates for Homology Inference

The lower bound requires two manifolds that are geometrically close but topologically distinct.

B = balld(0, 1! !) A = B \ balld(0, 2!)

The overlap

M1 = !(B! ) M2 = !(A! )

Page 48: Minimax Rates for Homology Inference

It suffices to bound the total variation distance.

Page 49: Minimax Rates for Homology Inference

It suffices to bound the total variation distance.Total Variation Distance:

TV(Q1, Q2) = supA

|Q1(A)!Q2(A)|

" amax{vol(M1 \M2), vol(M2 \M1)}

" Cda!d

Page 50: Minimax Rates for Homology Inference

It suffices to bound the total variation distance.

Minimax Risk:

Rn !1

8(1" TV(Q1, Q2)

2n !1

8(1" Cda!

d)2n !1

8e!2Cda!

dn

Total Variation Distance:

TV(Q1, Q2) = supA

|Q1(A)!Q2(A)|

" amax{vol(M1 \M2), vol(M2 \M1)}

" Cda!d

Page 51: Minimax Rates for Homology Inference

It suffices to bound the total variation distance.

Minimax Risk:

Rn !1

8(1" TV(Q1, Q2)

2n !1

8(1" Cda!

d)2n !1

8e!2Cda!

dn

Sampling Rate:n(!) !

!

1

"

"d

log1

#

Total Variation Distance:

TV(Q1, Q2) = supA

|Q1(A)!Q2(A)|

" amax{vol(M1 \M2), vol(M2 \M1)}

" Cda!d

Page 52: Minimax Rates for Homology Inference

The upper bound uses a union of balls to estimate the homology of M.

Page 53: Minimax Rates for Homology Inference

The upper bound uses a union of balls to estimate the homology of M.

Page 54: Minimax Rates for Homology Inference

The upper bound uses a union of balls to estimate the homology of M.

Page 55: Minimax Rates for Homology Inference

The upper bound uses a union of balls to estimate the homology of M.

Take a union of balls.1

Page 56: Minimax Rates for Homology Inference

The upper bound uses a union of balls to estimate the homology of M.

Take a union of balls.1Compute the homology of the resulting Cech complex.2

Page 57: Minimax Rates for Homology Inference

The upper bound uses a union of balls to estimate the homology of M.

Take a union of balls.1Compute the homology of the resulting Cech complex.2

Page 58: Minimax Rates for Homology Inference

The upper bound uses a union of balls to estimate the homology of M.

Take a union of balls.1Compute the homology of the resulting Cech complex.2

0 Denoise the data.

Page 59: Minimax Rates for Homology Inference

The upper bound uses a union of balls to estimate the homology of M.

Take a union of balls.1Compute the homology of the resulting Cech complex.2

0 Denoise the data.

Page 60: Minimax Rates for Homology Inference

The upper bound uses a union of balls to estimate the homology of M.

Take a union of balls.1Compute the homology of the resulting Cech complex.2

0 Denoise the data.

Page 61: Minimax Rates for Homology Inference

The upper bound uses a union of balls to estimate the homology of M.

Take a union of balls.1Compute the homology of the resulting Cech complex.2

0 Denoise the data.

Page 62: Minimax Rates for Homology Inference

The upper bound uses a union of balls to estimate the homology of M.

Take a union of balls.1Compute the homology of the resulting Cech complex.2

0 Denoise the data.

Page 63: Minimax Rates for Homology Inference

The upper bound uses a union of balls to estimate the homology of M.

Take a union of balls.1Compute the homology of the resulting Cech complex.2

0 Denoise the data.

To prove: The density is bounded from below near M and from above far from M.

Page 64: Minimax Rates for Homology Inference

Many fundamental problems are still open.

Page 65: Minimax Rates for Homology Inference

Many fundamental problems are still open.

1 Is the reach the right parameter?

Page 66: Minimax Rates for Homology Inference

Many fundamental problems are still open.

1 Is the reach the right parameter?

2 What about manifolds with boundary?

Page 67: Minimax Rates for Homology Inference

Many fundamental problems are still open.

1 Is the reach the right parameter?

2 What about manifolds with boundary?

3 Homotopy equivalence?

Page 68: Minimax Rates for Homology Inference

Many fundamental problems are still open.

1 Is the reach the right parameter?

2 What about manifolds with boundary?

3 Homotopy equivalence?

4 How to choose parameters?

Page 69: Minimax Rates for Homology Inference

Many fundamental problems are still open.

1 Is the reach the right parameter?

2 What about manifolds with boundary?

3 Homotopy equivalence?

4 How to choose parameters?

5 Are there efficient algorithms?

Page 70: Minimax Rates for Homology Inference

Thank you.