real-world data is often noisy - university of warwick · group, each group is sampled with prob. 1...

82
Distinct Sampling on Streaming Data joint work with Qin Zhang (IUB) with Near-Duplicates Jiecao Chen (IUB) Workshop on Data Summarisation, Mar/2018

Upload: others

Post on 22-Sep-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Real-world data is often noisy - University of Warwick · group, each group is sampled with prob. 1 F 0. Robust ` 0-sampling data: data points in R d ` 0-sampling: eachdistinctelement

Distinct Sampling on Streaming Data

joint work with Qin Zhang (IUB)

with Near-Duplicates

Jiecao Chen (IUB)

Workshop on Data Summarisation, Mar/2018

Page 2: Real-world data is often noisy - University of Warwick · group, each group is sampled with prob. 1 F 0. Robust ` 0-sampling data: data points in R d ` 0-sampling: eachdistinctelement

Real-world data is often noisy

Page 3: Real-world data is often noisy - University of Warwick · group, each group is sampled with prob. 1 F 0. Robust ` 0-sampling data: data points in R d ` 0-sampling: eachdistinctelement

Real-world data is often noisy

images, videos, music ... aftercompressions, resize etc.

Page 4: Real-world data is often noisy - University of Warwick · group, each group is sampled with prob. 1 F 0. Robust ` 0-sampling data: data points in R d ` 0-sampling: eachdistinctelement

Real-world data is often noisy

images, videos, music ... aftercompressions, resize etc.

queries of the same meaning sent to Google

”data summarization””summarization of data””the summarization of data””summarization data”

Page 5: Real-world data is often noisy - University of Warwick · group, each group is sampled with prob. 1 F 0. Robust ` 0-sampling data: data points in R d ` 0-sampling: eachdistinctelement

Models of Computation

Page 6: Real-world data is often noisy - University of Warwick · group, each group is sampled with prob. 1 F 0. Robust ` 0-sampling data: data points in R d ` 0-sampling: eachdistinctelement

Models of Computation

data stream

Page 7: Real-world data is often noisy - University of Warwick · group, each group is sampled with prob. 1 F 0. Robust ` 0-sampling data: data points in R d ` 0-sampling: eachdistinctelement

Models of Computation

data stream

• high-speed online data• want space/time

efficient algorithms

Page 8: Real-world data is often noisy - University of Warwick · group, each group is sampled with prob. 1 F 0. Robust ` 0-sampling data: data points in R d ` 0-sampling: eachdistinctelement

Models of Computation

data stream

• high-speed online data• want space/time

efficient algorithms

sliding window: only consider recent w items.

Page 9: Real-world data is often noisy - University of Warwick · group, each group is sampled with prob. 1 F 0. Robust ` 0-sampling data: data points in R d ` 0-sampling: eachdistinctelement

Magic hash function

Page 10: Real-world data is often noisy - University of Warwick · group, each group is sampled with prob. 1 F 0. Robust ` 0-sampling data: data points in R d ` 0-sampling: eachdistinctelement

Magic hash function

Does there exist a magic hash function that can

• maps all similiar items to one id• can be described succinctly?

Page 11: Real-world data is often noisy - University of Warwick · group, each group is sampled with prob. 1 F 0. Robust ` 0-sampling data: data points in R d ` 0-sampling: eachdistinctelement

Magic hash function

Does there exist a magic hash function that can

• maps all similiar items to one id• can be described succinctly?

unlikely to have one ...

Page 12: Real-world data is often noisy - University of Warwick · group, each group is sampled with prob. 1 F 0. Robust ` 0-sampling data: data points in R d ` 0-sampling: eachdistinctelement

Magic hash function

Does there exist a magic hash function that can

• maps all similiar items to one id• can be described succinctly?

unlikely to have one ...

So we could not apply existing streaming algorithms directly

need some new ideas

Page 13: Real-world data is often noisy - University of Warwick · group, each group is sampled with prob. 1 F 0. Robust ` 0-sampling data: data points in R d ` 0-sampling: eachdistinctelement

Robust `0-sampling

• data: data points in Rd• `0-sampling: each distinct element is sampled

with prob. 1F0

Page 14: Real-world data is often noisy - University of Warwick · group, each group is sampled with prob. 1 F 0. Robust ` 0-sampling data: data points in R d ` 0-sampling: eachdistinctelement

Robust `0-sampling

• data: data points in Rd• `0-sampling: each distinct element is sampled

with prob. 1F0

Robust `0-sampling: treat all close points as agroup, each group is sampled with prob. 1

F0

Page 15: Real-world data is often noisy - University of Warwick · group, each group is sampled with prob. 1 F 0. Robust ` 0-sampling data: data points in R d ` 0-sampling: eachdistinctelement

Robust `0-sampling

• data: data points in Rd• `0-sampling: each distinct element is sampled

with prob. 1F0

Robust `0-sampling: treat all close points as agroup, each group is sampled with prob. 1

F0

F0 = 6

Page 16: Real-world data is often noisy - University of Warwick · group, each group is sampled with prob. 1 F 0. Robust ` 0-sampling data: data points in R d ` 0-sampling: eachdistinctelement

Formally ...• S ⊂ Rd is (α, β)-sparse: either d(u, v) ≤ α ord(u, v) > β for all u, v ∈ S.

Page 17: Real-world data is often noisy - University of Warwick · group, each group is sampled with prob. 1 F 0. Robust ` 0-sampling data: data points in R d ` 0-sampling: eachdistinctelement

Formally ...• S ⊂ Rd is (α, β)-sparse: either d(u, v) ≤ α ord(u, v) > β for all u, v ∈ S.

• when β/α > 2,

G(v) = u ∈ S | d(u, v) ≤ α

forms a group of v• the dataset S is well-shaped• a natural partition exists for a well-shaped

dataset• F0 is the number of groups

Page 18: Real-world data is often noisy - University of Warwick · group, each group is sampled with prob. 1 F 0. Robust ` 0-sampling data: data points in R d ` 0-sampling: eachdistinctelement

Our goal:• S is well-shaped, fed as a data stream• G = G1, G2, . . . , GF0

is the natural partition• Goal: outputs a point u such that,

∀i ∈ [F0],Pr[u ∈ Gi] = 1/F0

Page 19: Real-world data is often noisy - University of Warwick · group, each group is sampled with prob. 1 F 0. Robust ` 0-sampling data: data points in R d ` 0-sampling: eachdistinctelement

Our goal:• S is well-shaped, fed as a data stream• G = G1, G2, . . . , GF0

is the natural partition• Goal: outputs a point u such that,

∀i ∈ [F0],Pr[u ∈ Gi] = 1/F0

focus on R2 case in this talk.can be extended to general d

Page 20: Real-world data is often noisy - University of Warwick · group, each group is sampled with prob. 1 F 0. Robust ` 0-sampling data: data points in R d ` 0-sampling: eachdistinctelement

Our goal:• S is well-shaped, fed as a data stream• G = G1, G2, . . . , GF0

is the natural partition• Goal: outputs a point u such that,

∀i ∈ [F0],Pr[u ∈ Gi] = 1/F0

focus on R2 case in this talk.can be extended to general d

our algorithm also work with general datasetsin RO(1) (discuss later)

Page 21: Real-world data is often noisy - University of Warwick · group, each group is sampled with prob. 1 F 0. Robust ` 0-sampling data: data points in R d ` 0-sampling: eachdistinctelement

Infinite Window (R2)

Page 22: Real-world data is often noisy - University of Warwick · group, each group is sampled with prob. 1 F 0. Robust ` 0-sampling data: data points in R d ` 0-sampling: eachdistinctelement

Infinite Window (R2)Basic idea• over the data stream, mark one representative point

for each group.

Page 23: Real-world data is often noisy - University of Warwick · group, each group is sampled with prob. 1 F 0. Robust ` 0-sampling data: data points in R d ` 0-sampling: eachdistinctelement

Infinite Window (R2)Basic idea• over the data stream, mark one representative point

for each group.

different groups

Page 24: Real-world data is often noisy - University of Warwick · group, each group is sampled with prob. 1 F 0. Robust ` 0-sampling data: data points in R d ` 0-sampling: eachdistinctelement

Infinite Window (R2)Basic idea• over the data stream, mark one representative point

for each group.

different groups

e.g., pick the first arrived point ofeach group

Page 25: Real-world data is often noisy - University of Warwick · group, each group is sampled with prob. 1 F 0. Robust ` 0-sampling data: data points in R d ` 0-sampling: eachdistinctelement

Infinite Window (R2)

only sample from representative points

Basic idea• over the data stream, mark one representative point

for each group.

different groups

e.g., pick the first arrived point ofeach group

Page 26: Real-world data is often noisy - University of Warwick · group, each group is sampled with prob. 1 F 0. Robust ` 0-sampling data: data points in R d ` 0-sampling: eachdistinctelement

Infinite Window (R2)

only sample from representative points

Question: Can we identify (not necessarily store) thefirst arrived point of each group space-efficiently?

Basic idea• over the data stream, mark one representative point

for each group.

different groups

e.g., pick the first arrived point ofeach group

Page 27: Real-world data is often noisy - University of Warwick · group, each group is sampled with prob. 1 F 0. Robust ` 0-sampling data: data points in R d ` 0-sampling: eachdistinctelement

Question: Can we identify (not necessarily store) thefirst arrived point of each group space-efficiently?

Page 28: Real-world data is often noisy - University of Warwick · group, each group is sampled with prob. 1 F 0. Robust ` 0-sampling data: data points in R d ` 0-sampling: eachdistinctelement

Question: Can we identify (not necessarily store) thefirst arrived point of each group space-efficiently?

Unfortunately, Ω(F0) space required to identifythe representative points in noisy dataset

Solution: sample in advance

Page 29: Real-world data is often noisy - University of Warwick · group, each group is sampled with prob. 1 F 0. Robust ` 0-sampling data: data points in R d ` 0-sampling: eachdistinctelement

Question: Can we identify (not necessarily store) thefirst arrived point of each group space-efficiently?

Unfortunately, Ω(F0) space required to identifythe representative points in noisy dataset

• how to sample in advance?– place a random grid (side length α

2 ) in R2,sample cells before we see the data stream

Solution: sample in advance

Page 30: Real-world data is often noisy - University of Warwick · group, each group is sampled with prob. 1 F 0. Robust ` 0-sampling data: data points in R d ` 0-sampling: eachdistinctelement

Question: Can we identify (not necessarily store) thefirst arrived point of each group space-efficiently?

Unfortunately, Ω(F0) space required to identifythe representative points in noisy dataset

• how to sample in advance?– place a random grid (side length α

2 ) in R2,sample cells before we see the data stream

Solution: sample in advance

• how to decide the sample rate?– decrease when see more groups

Page 31: Real-world data is often noisy - University of Warwick · group, each group is sampled with prob. 1 F 0. Robust ` 0-sampling data: data points in R d ` 0-sampling: eachdistinctelement
Page 32: Real-world data is often noisy - University of Warwick · group, each group is sampled with prob. 1 F 0. Robust ` 0-sampling data: data points in R d ` 0-sampling: eachdistinctelement

• blue cells: sampled cell• red points: first arrived point of its

group

Page 33: Real-world data is often noisy - University of Warwick · group, each group is sampled with prob. 1 F 0. Robust ` 0-sampling data: data points in R d ` 0-sampling: eachdistinctelement

• blue cells: sampled cell• red points: first arrived point of its

group

ignoredrejected

accepted

three types of groups:• accepted: first arrived

point falls into asampled cell

• ignored: no point fallsinto a sampled cell

• rejected: has pointfalling into a sampledcell, but not the firstarrived point

Page 34: Real-world data is often noisy - University of Warwick · group, each group is sampled with prob. 1 F 0. Robust ` 0-sampling data: data points in R d ` 0-sampling: eachdistinctelement

How to maintain accepted groups?

Page 35: Real-world data is often noisy - University of Warwick · group, each group is sampled with prob. 1 F 0. Robust ` 0-sampling data: data points in R d ` 0-sampling: eachdistinctelement

How to maintain accepted groups?

• keep all first points of accepted groups

Page 36: Real-world data is often noisy - University of Warwick · group, each group is sampled with prob. 1 F 0. Robust ` 0-sampling data: data points in R d ` 0-sampling: eachdistinctelement

How to maintain accepted groups?

• keep all first points of accepted groups

keep it!

Page 37: Real-world data is often noisy - University of Warwick · group, each group is sampled with prob. 1 F 0. Robust ` 0-sampling data: data points in R d ` 0-sampling: eachdistinctelement

How to maintain accepted groups?

• keep all first points of accepted groups

keep it! discard?

Page 38: Real-world data is often noisy - University of Warwick · group, each group is sampled with prob. 1 F 0. Robust ` 0-sampling data: data points in R d ` 0-sampling: eachdistinctelement

How to maintain accepted groups?

• keep all first points of accepted groups

keep it! discard?

If we discard the first arrived point of a rejected group...

is this the first arrived point? discard

Page 39: Real-world data is often noisy - University of Warwick · group, each group is sampled with prob. 1 F 0. Robust ` 0-sampling data: data points in R d ` 0-sampling: eachdistinctelement

How to maintain accepted groups?

• keep all first points of accepted groups

keep it! discard?

If we discard the first arrived point of a rejected group...

is this the first arrived point? discardwe don’t know!

Page 40: Real-world data is often noisy - University of Warwick · group, each group is sampled with prob. 1 F 0. Robust ` 0-sampling data: data points in R d ` 0-sampling: eachdistinctelement

How to maintain accepted groups?

• keep all first points of accepted groups

keep it! discard?

If we discard the first arrived point of a rejected group...

is this the first arrived point? discardwe don’t know!

So we need to keep the first arrived point of a rejected group!

Page 41: Real-world data is often noisy - University of Warwick · group, each group is sampled with prob. 1 F 0. Robust ` 0-sampling data: data points in R d ` 0-sampling: eachdistinctelement

How to maintain accepted groups?

• keep all first points of accepted groups

keep it! discard?

If we discard the first arrived point of a rejected group...

is this the first arrived point? discardwe don’t know!

So we need to keep the first arrived point of a rejected group!

ADJ(p) = cells with distance smaller than α from p

Page 42: Real-world data is often noisy - University of Warwick · group, each group is sampled with prob. 1 F 0. Robust ` 0-sampling data: data points in R d ` 0-sampling: eachdistinctelement

How to maintain accepted groups?

• keep all first points of accepted groups

keep it! discard?

If we discard the first arrived point of a rejected group...

is this the first arrived point? discardwe don’t know!

So we need to keep the first arrived point of a rejected group!

ADJ(p) = cells with distance smaller than α from p

p

if p is not in a sampled celland ADJ(p) has cellsampled...

keep p!α

ADJ(p)

Page 43: Real-world data is often noisy - University of Warwick · group, each group is sampled with prob. 1 F 0. Robust ` 0-sampling data: data points in R d ` 0-sampling: eachdistinctelement

Now we keep two sets,• Sacc: first arrived points of accepted groups• Srej = first point p 6∈ Sacc and ADJ(p) has sampled cell

Page 44: Real-world data is often noisy - University of Warwick · group, each group is sampled with prob. 1 F 0. Robust ` 0-sampling data: data points in R d ` 0-sampling: eachdistinctelement

Now we keep two sets,• Sacc: first arrived points of accepted groups• Srej = first point p 6∈ Sacc and ADJ(p) has sampled cell

In R2, |Srej| = O(|Sacc|) w.h.p.

Page 45: Real-world data is often noisy - University of Warwick · group, each group is sampled with prob. 1 F 0. Robust ` 0-sampling data: data points in R d ` 0-sampling: eachdistinctelement

Now we keep two sets,• Sacc: first arrived points of accepted groups• Srej = first point p 6∈ Sacc and ADJ(p) has sampled cell

In R2, |Srej| = O(|Sacc|) w.h.p.

how to decide the sample rate?

Page 46: Real-world data is often noisy - University of Warwick · group, each group is sampled with prob. 1 F 0. Robust ` 0-sampling data: data points in R d ` 0-sampling: eachdistinctelement

Now we keep two sets,• Sacc: first arrived points of accepted groups• Srej = first point p 6∈ Sacc and ADJ(p) has sampled cell

In R2, |Srej| = O(|Sacc|) w.h.p.

how to decide the sample rate?

• if |Sacc| > κ logm, re-sample eachsampled cell with prob. 1

2• roughly, Sacc will drop half of its size• Sacc is not empty w.h.p.• space usage O(logm)

Page 47: Real-world data is often noisy - University of Warwick · group, each group is sampled with prob. 1 F 0. Robust ` 0-sampling data: data points in R d ` 0-sampling: eachdistinctelement

Extend to Sliding Window is non-trivial ...

Page 48: Real-world data is often noisy - University of Warwick · group, each group is sampled with prob. 1 F 0. Robust ` 0-sampling data: data points in R d ` 0-sampling: eachdistinctelement

Extend to Sliding Window is non-trivial ...

What is the main difficulty?

current windowexpired

Page 49: Real-world data is often noisy - University of Warwick · group, each group is sampled with prob. 1 F 0. Robust ` 0-sampling data: data points in R d ` 0-sampling: eachdistinctelement

Extend to Sliding Window is non-trivial ...

What is the main difficulty?

current windowexpired

• the representative point gets expired• but the group is still valid!

Page 50: Real-world data is often noisy - University of Warwick · group, each group is sampled with prob. 1 F 0. Robust ` 0-sampling data: data points in R d ` 0-sampling: eachdistinctelement

Extend to Sliding Window is non-trivial ...

What is the main difficulty?

current windowexpired

• the representative point gets expired• but the group is still valid!

in Sacc, maintain pairs,(u, p)

the representative point

the latest point in G(u)

Page 51: Real-world data is often noisy - University of Warwick · group, each group is sampled with prob. 1 F 0. Robust ` 0-sampling data: data points in R d ` 0-sampling: eachdistinctelement

Any other issues?

Page 52: Real-world data is often noisy - University of Warwick · group, each group is sampled with prob. 1 F 0. Robust ` 0-sampling data: data points in R d ` 0-sampling: eachdistinctelement

Any other issues?

• pairs in Sacc get expired (unlike in IW)• can not keep decreasing the sample rate

Page 53: Real-world data is often noisy - University of Warwick · group, each group is sampled with prob. 1 F 0. Robust ` 0-sampling data: data points in R d ` 0-sampling: eachdistinctelement

Any other issues?

• pairs in Sacc get expired (unlike in IW)• can not keep decreasing the sample rate

Can we increase the sample rate?

Page 54: Real-world data is often noisy - University of Warwick · group, each group is sampled with prob. 1 F 0. Robust ` 0-sampling data: data points in R d ` 0-sampling: eachdistinctelement

Any other issues?

• pairs in Sacc get expired (unlike in IW)• can not keep decreasing the sample rate

Can we increase the sample rate?No, previous information is lost.

Page 55: Real-world data is often noisy - University of Warwick · group, each group is sampled with prob. 1 F 0. Robust ` 0-sampling data: data points in R d ` 0-sampling: eachdistinctelement

Any other issues?

• pairs in Sacc get expired (unlike in IW)• can not keep decreasing the sample rate

Can we increase the sample rate?No, previous information is lost.

Solution?

Page 56: Real-world data is often noisy - University of Warwick · group, each group is sampled with prob. 1 F 0. Robust ` 0-sampling data: data points in R d ` 0-sampling: eachdistinctelement

Any other issues?

• pairs in Sacc get expired (unlike in IW)• can not keep decreasing the sample rate

Can we increase the sample rate?No, previous information is lost.

Solution?maintain a sampler (Level) for each possible sample rate:1, 1/2, 1/22, 1/23, 1/24 . . .

Page 57: Real-world data is often noisy - University of Warwick · group, each group is sampled with prob. 1 F 0. Robust ` 0-sampling data: data points in R d ` 0-sampling: eachdistinctelement

Any other issues?

• pairs in Sacc get expired (unlike in IW)• can not keep decreasing the sample rate

Can we increase the sample rate?No, previous information is lost.

Solution?maintain a sampler (Level) for each possible sample rate:1, 1/2, 1/22, 1/23, 1/24 . . .

• Level i samples cellswith prob. 1

2i

• discard expiredgroups

Page 58: Real-world data is often noisy - University of Warwick · group, each group is sampled with prob. 1 F 0. Robust ` 0-sampling data: data points in R d ` 0-sampling: eachdistinctelement

Any other issues?

Page 59: Real-world data is often noisy - University of Warwick · group, each group is sampled with prob. 1 F 0. Robust ` 0-sampling data: data points in R d ` 0-sampling: eachdistinctelement

Any other issues?

• a level may sample > κ · logm• can not do re-sampling because of the fixed rate

Page 60: Real-world data is often noisy - University of Warwick · group, each group is sampled with prob. 1 F 0. Robust ` 0-sampling data: data points in R d ` 0-sampling: eachdistinctelement

Any other issues?

• a level may sample > κ · logm• can not do re-sampling because of the fixed rate

Level i+ 1Level i

Srej Sacc

Page 61: Real-world data is often noisy - University of Warwick · group, each group is sampled with prob. 1 F 0. Robust ` 0-sampling data: data points in R d ` 0-sampling: eachdistinctelement

Any other issues?

• a level may sample > κ · logm• can not do re-sampling because of the fixed rate

Level i+ 1Level i

Srej Sacc

Page 62: Real-world data is often noisy - University of Warwick · group, each group is sampled with prob. 1 F 0. Robust ` 0-sampling data: data points in R d ` 0-sampling: eachdistinctelement

Any other issues?

• a level may sample > κ · logm• can not do re-sampling because of the fixed rate

Level i+ 1> κ · logmLevel i

Srej Sacc

Page 63: Real-world data is often noisy - University of Warwick · group, each group is sampled with prob. 1 F 0. Robust ` 0-sampling data: data points in R d ` 0-sampling: eachdistinctelement

Any other issues?

• a level may sample > κ · logm• can not do re-sampling because of the fixed rate

Level i+ 1Level i

Srej Sacc

re-sample cells withprob. 1

2

Page 64: Real-world data is often noisy - University of Warwick · group, each group is sampled with prob. 1 F 0. Robust ` 0-sampling data: data points in R d ` 0-sampling: eachdistinctelement

Any other issues?

• a level may sample > κ · logm• can not do re-sampling because of the fixed rate

Level i+ 1Level i

Srej Sacc

Page 65: Real-world data is often noisy - University of Warwick · group, each group is sampled with prob. 1 F 0. Robust ` 0-sampling data: data points in R d ` 0-sampling: eachdistinctelement

Any other issues?

• a level may sample > κ · logm• can not do re-sampling because of the fixed rate

Level i+ 1Level i

Srej Sacc

Page 66: Real-world data is often noisy - University of Warwick · group, each group is sampled with prob. 1 F 0. Robust ` 0-sampling data: data points in R d ` 0-sampling: eachdistinctelement

Any other issues?

• a level may sample > κ · logm• can not do re-sampling because of the fixed rate

Level i+ 1Level i

• Level i+ 1 now may have > κ · logm• do the re-sampling again in Level i+ 1• this process may cascade to the top level• each level actually samples from a disjoint

subwindow

the last point mustbe in Sacc.Invariant duringthe split/mergeprocess

Srej Sacc

Page 67: Real-world data is often noisy - University of Warwick · group, each group is sampled with prob. 1 F 0. Robust ` 0-sampling data: data points in R d ` 0-sampling: eachdistinctelement

Any other issues?

• a level may sample > κ · logm• can not do re-sampling because of the fixed rate

Level i+ 1Level i

How to generate a sample at the point of query?

Srej Sacc

Page 68: Real-world data is often noisy - University of Warwick · group, each group is sampled with prob. 1 F 0. Robust ` 0-sampling data: data points in R d ` 0-sampling: eachdistinctelement

Any other issues?

• a level may sample > κ · logm• can not do re-sampling because of the fixed rate

Level i+ 1Level i

How to generate a sample at the point of query?

• Level T is the top non-empty level• all groups in Level T − i is re-sampled with

prob. 12i

• union all sampled groups• then return a sample uniformly at random

Srej Sacc

Page 69: Real-world data is often noisy - University of Warwick · group, each group is sampled with prob. 1 F 0. Robust ` 0-sampling data: data points in R d ` 0-sampling: eachdistinctelement

Theorems

For well-shaped datasets in RO(1)

• infinite window: use O(logm) space, O(logm)processing time

• sliding window: use O(logm logw) space,O(logm logw) amortized processing time

Page 70: Real-world data is often noisy - University of Warwick · group, each group is sampled with prob. 1 F 0. Robust ` 0-sampling data: data points in R d ` 0-sampling: eachdistinctelement

Theorems

For well-shaped datasets in RO(1)

• infinite window: use O(logm) space, O(logm)processing time

• sliding window: use O(logm logw) space,O(logm logw) amortized processing time

High Dimension Rd?

If β/α > d1.5,• infinite window: O(d logm) space andO(d logm) processing time

• sliding window: O(d logm logw) space andO(d logm logw) amertized processing time

Page 71: Real-world data is often noisy - University of Warwick · group, each group is sampled with prob. 1 F 0. Robust ` 0-sampling data: data points in R d ` 0-sampling: eachdistinctelement

Theorems

For well-shaped datasets in RO(1)

• infinite window: use O(logm) space, O(logm)processing time

• sliding window: use O(logm logw) space,O(logm logw) amortized processing time

High Dimension Rd?

If β/α > d1.5,• infinite window: O(d logm) space andO(d logm) processing time

• sliding window: O(d logm logw) space andO(d logm logw) amertized processing time

applying dimension reduction reduces d to O(logm)

Page 72: Real-world data is often noisy - University of Warwick · group, each group is sampled with prob. 1 F 0. Robust ` 0-sampling data: data points in R d ` 0-sampling: eachdistinctelement

General datasets ...

Page 73: Real-world data is often noisy - University of Warwick · group, each group is sampled with prob. 1 F 0. Robust ` 0-sampling data: data points in R d ` 0-sampling: eachdistinctelement

General datasets ...even the partition of S is not well-definedWhat is our goal?

Page 74: Real-world data is often noisy - University of Warwick · group, each group is sampled with prob. 1 F 0. Robust ` 0-sampling data: data points in R d ` 0-sampling: eachdistinctelement

General datasets ...even the partition of S is not well-definedWhat is our goal?

Definition (F0): minimum cardinality partition of S,G1, . . . , GF0

, so that all points in the same group hasdistance < α

Page 75: Real-world data is often noisy - University of Warwick · group, each group is sampled with prob. 1 F 0. Robust ` 0-sampling data: data points in R d ` 0-sampling: eachdistinctelement

General datasets ...even the partition of S is not well-definedWhat is our goal?

Definition (F0): minimum cardinality partition of S,G1, . . . , GF0

, so that all points in the same group hasdistance < α

Goal: Robust `0 sampling returns q,

∀p ∈ S, Pr[q ∈ Ball(p, α) ∩ S] = Θ(1/F0).

Page 76: Real-world data is often noisy - University of Warwick · group, each group is sampled with prob. 1 F 0. Robust ` 0-sampling data: data points in R d ` 0-sampling: eachdistinctelement

General datasets ...even the partition of S is not well-definedWhat is our goal?

Definition (F0): minimum cardinality partition of S,G1, . . . , GF0

, so that all points in the same group hasdistance < α

Goal: Robust `0 sampling returns q,

∀p ∈ S, Pr[q ∈ Ball(p, α) ∩ S] = Θ(1/F0).

consistent with the well-shaped case

Page 77: Real-world data is often noisy - University of Warwick · group, each group is sampled with prob. 1 F 0. Robust ` 0-sampling data: data points in R d ` 0-sampling: eachdistinctelement

General datasets ...even the partition of S is not well-definedWhat is our goal?

Definition (F0): minimum cardinality partition of S,G1, . . . , GF0

, so that all points in the same group hasdistance < α

Goal: Robust `0 sampling returns q,

∀p ∈ S, Pr[q ∈ Ball(p, α) ∩ S] = Θ(1/F0).

our algorithms in well-shaped dataset can achieve thisgoal in RO(1)

consistent with the well-shaped case

Page 78: Real-world data is often noisy - University of Warwick · group, each group is sampled with prob. 1 F 0. Robust ` 0-sampling data: data points in R d ` 0-sampling: eachdistinctelement

Open Questions

Page 79: Real-world data is often noisy - University of Warwick · group, each group is sampled with prob. 1 F 0. Robust ` 0-sampling data: data points in R d ` 0-sampling: eachdistinctelement

Open Questions

• How to weaken the requirement β/α > d1.5 in Rd?

Page 80: Real-world data is often noisy - University of Warwick · group, each group is sampled with prob. 1 F 0. Robust ` 0-sampling data: data points in R d ` 0-sampling: eachdistinctelement

Open Questions

• How to weaken the requirement β/α > d1.5 in Rd?

• Other formulations for robust `0-sampling ongeneral data sets?

Page 81: Real-world data is often noisy - University of Warwick · group, each group is sampled with prob. 1 F 0. Robust ` 0-sampling data: data points in R d ` 0-sampling: eachdistinctelement

Open Questions

• How to weaken the requirement β/α > d1.5 in Rd?

• Other formulations for robust `0-sampling ongeneral data sets?

• How to extend to other metric spaces?

Page 82: Real-world data is often noisy - University of Warwick · group, each group is sampled with prob. 1 F 0. Robust ` 0-sampling data: data points in R d ` 0-sampling: eachdistinctelement

Questions?

Thank you!