atish das sarma, ashwin lall, danupon nanongkai, jun xu 1 georgia tech vldb 2009

Post on 31-Mar-2015

217 Views

Category:

Documents

4 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

Atish Das Sarma, Ashwin Lall, Danupon Nanongkai, Jun Xu

Randomized Multi -pass Streaming Skyline Algorithm

Georgia Tech

VLDB 2009

2

In one sentence ….

3

“We develop a streaming algorithm

4

“We develop a streaming algorithm for skyline problem

5

“We develop a streaming algorithm for skyline problem with near-optimal worst-case guarantee.”

6

What is skyline?

7

Hotel Price DistanceAthena $97 2.9 km

Park & Suites $124 3.6 km

Hotel du Helder $76 3.8 km

de la Cité Concorde $220 0.67 km

Mercure Carlton Lyon $163 3.0 km

I want a cheap hotel

nearby

8

Hotel Price DistanceAthena $97 2.9 km

Park & Suites $124 3.6 km

Hotel du Helder $76 3.8 km

de la Cité Concorde $220 0.67 km

Mercure Carlton Lyon $163 3.0 km

I want a cheap hotel

nearbydo

min

ates

9

Hotel Price DistanceAthena $97 2.9 km

Park & Suites $124 3.6 km

Hotel du Helder $76 3.8 km

de la Cité Concorde $220 0.67 km

Mercure Carlton Lyon $163 2.9 km

I want a cheap hotel

nearbydo

min

ates

10

Price

Distance

de la Cite

Park & Suites

du HelderAthena

Mercure

11

Price

Distance

de la Cite

Park & Suites

du HelderAthena

Mercure

12

Problem definition

• Given distinct d-dimensional points• (a1, …, ad) dominates (b1, …, bd) if ai ≤ bi for all i

and ai’ < bi’ for some i’• Skyline = set of undominated points

dominatesSkyline = { (1, 3) , (3, 2) }

(5,2)

(1,3)

(3,2)

Example(1, 3) , (5, 2) , (3, 2)

13

Skyline algorithms

RAM Disk (External)

Preprocessing Non-preprocessingBBS Papadias et al. SIGMOD’03NN Kossman et al. VLDB’02

DD&C Kung et al. FOCS’ 75LD&C Bently et al. JACM’78, FLET Bently et al. SODA’90,

SD&C Borzsonyi et al. ICDE’01,BNL Borzsonyi et al. ICDE’01, SFS Chomicki et al. ICDE’03, LESS Godfrey et al. VLDB’05

14

Our Goal“Non-preprocessing external

algorithm with worst-case guarantee”

What is the model of external algorithms?

15

CPU process ≠ I/OSequental I/O ≠ Random I/O

Models for external algorithms

Multi-pass Streaming

Model

# of random I/O’s = # of passes

Streaming model naturally forces us to minimize the number of random I/O’s

16

What is multi-pass stream?

17

(1, 2) (3, 7) (5, 3) (2, 5) (4, 1) (9, 9)

Small RAM

Huge Harddisk

18

(1, 2) (3, 7) (5, 3) (2, 5) (4, 1) (9, 9)

Small RAM

Huge Harddisk

19

(1, 2) (3, 7) (5, 3) (2, 5) (4, 1) (9, 9)

Small RAM

Huge Harddisk

20

(1, 2) (3, 7) (5, 3) (2, 5) (4, 1) (9, 9)

Small RAM

Huge Harddisk

21

(1, 2) (3, 7) (5, 3) (2, 5) (4, 1) (9, 9)

Small RAM

Huge Harddisk

2nd pass

22

(1, 2) (3, 7) (5, 3) (2, 5) (4, 1) (9, 9)

Small RAM

Huge Harddisk

3rd pass

23

Our Goal

“Non-preprocessing external algorithm with worst-case guarantee”

streaming

24

Main resultsTheory

RAND: Almost optimal multi-pass streaming algorithm for skyline

O(log n) passes & O(m) space

n = # of points and m = skyline size

1 pass needs Ω(n) space

• RAND uses O(log n) passes & O(m) space• Every algorithm that uses 1 pass needs Ω(n) space

Next: RAND algorithmLater: Experimental result

25

RAND algorithm

26

Algorithms: Main Idea

Suppose m is known.Theorem: In 3 passes and m space, we

can find skyline points that “dominate” at least n/2 points, with high probability

Eliminate-Points algorithm

1. Sample x=2m ln(mn log n) points p1, p2, …, px

2. Go through the stream,Replace each pi by a point dominating it

3. For each pi, delete pi and all points it dominates

Output p1, p2, …, px and repeat(1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4)

(4, 4)27

28

Eliminate-Points algorithm

(1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4)

(4, 4)

1. Sample x=2m ln(mn log n) points p1, p2, …, px

2. Go through the stream,Replace each pi by a point dominating it

3. For each pi, delete pi and all points it dominates

Output p1, p2, …, px and repeat

29

Eliminate-Points algorithm

(1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4)

(4, 4)

1. Sample x=2m ln(mn log n) points p1, p2, …, px

2. Go through the stream,Replace each pi by a point dominating it

3. For each pi, delete pi and all points it dominates

Output p1, p2, …, px and repeat

30

Eliminate-Points algorithm

(1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4)

(4, 4)(3, 4)

1. Sample x=2m ln(mn log n) points p1, p2, …, px

2. Go through the stream,Replace each pi by a point dominating it

3. For each pi, delete pi and all points it dominates

Output p1, p2, …, px and repeat

31

Eliminate-Points algorithm

(1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4)

(3, 4)

1. Sample x=2m ln(mn log n) points p1, p2, …, px

2. Go through the stream,Replace each pi by a point dominating it

3. For each pi, delete pi and all points it dominates

Output p1, p2, …, px and repeat

32

Eliminate-Points algorithm

(1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4)

(3, 4)

1. Sample x=2m ln(mn log n) points p1, p2, …, px

2. Go through the stream,Replace each pi by a point dominating it

3. For each pi, delete pi and all points it dominates

Output p1, p2, …, px and repeat

33

Eliminate-Points algorithm

(1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4)

(3, 4)(3, 3)

1. Sample x=2m ln(mn log n) points p1, p2, …, px

2. Go through the stream,Replace each pi by a point dominating it

3. For each pi, delete pi and all points it dominates

Output p1, p2, …, px and repeat

34

Eliminate-Points algorithm

(1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4)

(3, 4)(3, 3)

1. Sample x=2m ln(mn log n) points p1, p2, …, px

2. Go through the stream,Replace each pi by a point dominating it

3. For each pi, delete pi and all points it dominates

Output p1, p2, …, px and repeat

35

Eliminate-Points algorithm

(1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4)

(3, 4)(3, 3)

1. Sample x=2m ln(mn log n) points p1, p2, …, px

2. Go through the stream,Replace each pi by a point dominating it

3. For each pi, delete pi and all points it dominates

Output p1, p2, …, px and repeat

36

Eliminate-Points algorithm

(1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4)

(3, 4)(3, 3)

1. Sample x=2m ln(mn log n) points p1, p2, …, px

2. Go through the stream,Replace each pi by a point dominating it

3. For each pi, delete pi and all points it dominates

Output p1, p2, …, px and repeat

37

Eliminate-Points algorithm

(1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4)

(3, 4)(3, 3)

1. Sample x=2m ln(mn log n) points p1, p2, …, px

2. Go through the stream,Replace each pi by a point dominating it

3. For each pi, delete pi and all points it dominates

Output p1, p2, …, px and repeat

38

Eliminate-Points algorithm

(1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4)

(3, 4)(3, 3)

1. Sample x=2m ln(mn log n) points p1, p2, …, px

2. Go through the stream,Replace each pi by a point dominating it

3. For each pi, delete pi and all points it dominates

Output p1, p2, …, px and repeat

39

Eliminate-Points algorithm

(1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4)

(3, 4)(3, 3)

1. Sample x=2m ln(mn log n) points p1, p2, …, px

2. Go through the stream,Replace each pi by a point dominating it

3. For each pi, delete pi and all points it dominates

Output p1, p2, …, px and repeat

40

Eliminate-Points algorithm

(1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4)

(3, 4)(3, 3)

1. Sample x=2m ln(mn log n) points p1, p2, …, px

2. Go through the stream,Replace each pi by a point dominating it

3. For each pi, delete pi and all points it dominates

Output p1, p2, …, px and repeat

41

Analysis

Theorem: Eliminate-Points algorithm deletes at least n/2 points with high probability

42

Analysis

• Draw trees: Each point points to its first dominating point

(1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4)

1, 5 3, 3

3, 4 4, 3

4, 4

4, 5

43

Analysis

• Draw trees: Each point points to its first dominating point

(1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4)

1, 5 3, 3

3, 4 4, 3

4, 4

4, 5

44

Analysis

• Draw trees: Each point points to its first dominating point

(1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4)

1, 5 3, 3

3, 4 4, 3

4, 4

4, 5

Note: There will be m trees, each rooted by a skyline point

45

Analysis

• Draw trees: Each point points to its first dominating point

(1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4)

(4, 4)

1, 5 3, 3

3, 4 4, 3

4, 4

4, 5

46

Analysis

• Draw trees: Each point points to its first dominating point

(1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4)

1, 5 3, 3

3, 4 4, 3

4, 4

4, 5

(3, 3)

47

4, 4

Analysis

• Claim: The tree that some element is sampled will be deleted

(1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4)

1, 5 3, 3

3, 4 4, 34, 5

(3, 3)

48

Analysis

• There are m trees, each rooted by a skyline point

1 2 mm-1

49

Analysis

• There are m trees, each rooted by a skyline point

1 2 mm-1

50

Analysis

• Big tree has bigger chance of being sampled… and deleted

1 2 mm-1

51

Analysis

• If enough points are sampled, every tree that is “big enough” will be deleted

1 2 mm-1

52

Analysis

Lemma: With high probability, all trees of size n/(2m) are deleted

• We delete n/2 points in total1 2 mm-1

53

Extending to RAND• Recall: If we know m then we can delete n/2 points

in 3 passes• If m is known, we can find skyline in O(log n)

passes with high probability– We delete n/2 points every 3 passes

• m is not known– Guess m by “doubling trick” – Additional O(log m) passes

• Fixed-window case – Memory space is limited

• Random I/O’s, Sequential I/O’s and Number of comparisons have to be analyzed separately

54

Main resultsTheory

RAND: Almost optimal multi-pass streaming algorithm for skyline

O(log n) passes & O(m) space

n = # of points and m = skyline size

1 pass needs Ω(n) space

• RAND uses O(log n) passes & O(m) space• Every algorithm that uses 1 pass needs Ω(n) space

55

TheoryRAND: Almost optimal multi-pass streaming algorithm for skyline

O(log n) passes & O(m) space

n = # of points and m = skyline size

1 pass needs Ω(n) space

Algorithms comparison w = window (memory) size

Main results

Algorithm Random I/O’s Sequential I/O’s ComparisonsBNL(w) Q(min{w, n/w}) Q(min{w, n2/w}) Q(dmin{wmn, n2})LESS(w) Q(n logw (n/w)) Q(mn/w) Q(dmn+n log n)

RAND(w) O(m log (n/w)) O(mn/w) O(dmn)

56

Main resultsExperiment RAND BNL & LESSvs

Average case

Worst case

We try several datasets in the literature …

Correlate, Anti-correlated, Independent,Island, House, NBA, Color

57

Average case- No clear winner between BNL and LESS- RAND is always close to the winner

Experimental Results

RAND BNL & LESS

Experimental Results

58RAND

“Worse”: After sorting by decreasing first coordinate- RAND is the most robust and usually fastest

BNL & LESS

Experimental Results

59RAND BNL & LESS

“Even Worse”: After sorting by “entropy”

Summary

60

(1, 2) (3, 7) (5, 3) (2, 5) (4, 1) (9, 9)

60

RAND BNL & LESS

Average case

Worst case

Disk Stream

1 2 mm-1Random Sampling RAND

Experiment

61

Extensions• Distributed skyline algorithm• Derandomize the algorithm for 2D case• Skyline for partially ordered sets (posets)Open problems• Develop algorithm on Parallel Disk Model

(PDM) and Cache Oblivious model• Extend the techniques to pre-processing

algorithm• Is O(log n) passes the best possible?

Summary

62

Thank you

63

Appendix

64

Charts for average case

65

66

The lower bound

Theorem: Any randomized one-pass algorithm with space at most n/2 succeeds with probability at most 1/2

Proof- Random unique survivor- 2 points come at the end- If space <= n/2 then will fail if didn’t store survivor in the memory

67

Proof of Claim

68

Proof of Claim

• Claim: The tree that some element is sampled will be deleted

(1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4)

1, 5 3, 3

3, 4 4, 3

4, 4

4, 5

4, 4

(3, 3)

69

Analysis

• Draw trees: Each point points to its first dominating point

(1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4)

(4, 4)

1, 5 3, 3

3, 4 4, 3

4, 4

4, 5

4, 4

70

Analysis

• Draw trees: Each point points to its first dominating point

(1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4)

(4, 4)

1, 5 3, 3

3, 4 4, 3

4, 4

4, 5

4, 4

(3, 4)

3, 4

71

Analysis

• Draw trees: Each point points to its first dominating point

(1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4)

(3, 4)

1, 5 3, 3

3, 4 4, 3

4, 4

4, 5

4, 4

(3, 3)

3, 4

3, 3

72

Analysis

• Draw trees: Each point points to its first dominating point

(1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4)

(3, 4)

1, 5 3, 3

3, 4 4, 3

4, 4

4, 5

4, 4

(3, 3)

3, 4

3, 3

73

Analysis

• Draw trees: Each point points to its first dominating point

(1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4)

(3, 4)

1, 5 3, 3

3, 4 4, 3

4, 4

4, 5

4, 4

(3, 3)

3, 4

3, 3

74

Analysis

• Draw trees: Each point points to its first dominating point

(1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4)

(3, 4)

1, 5 3, 3

3, 4 4, 3

4, 4

4, 5

4, 4

(3, 3)

3, 4

3, 3

top related