a new point access method based on wavelet trees nieves r. brisaboa, miguel r. luaces, diego seco...

28
A New Point Access Method based on Wavelet Trees Nieves R. Brisaboa, Miguel R. Luaces, Diego Seco Database Database Laboratory Laboratory University of A University of A Coruña Coruña A Coruña, Spain A Coruña, Spain Gonzalo Navarro Department of Computer Department of Computer Science Science University of Chile University of Chile Santiago, Chile Santiago, Chile

Post on 21-Dec-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

A New Point Access Method based on Wavelet Trees

Nieves R. Brisaboa,

Miguel R. Luaces,

Diego Seco

Database LaboratoryDatabase LaboratoryUniversity of A CoruñaUniversity of A CoruñaA Coruña, SpainA Coruña, Spain

Gonzalo NavarroDepartment of Computer ScienceDepartment of Computer ScienceUniversity of ChileUniversity of ChileSantiago, ChileSantiago, Chile

Gramado - SeCoGIS 2009 2 11th November, 2009

Outline

Motivation Compressed Data Structures PW-Tree Experiments Conclusions and Future Work

Gramado - SeCoGIS 2009 3 11th November, 2009

Outline

Motivation Compressed Data Structures PW-Tree Experiments Conclusions and Future Work

Gramado - SeCoGIS 2009 4 11th November, 2009

Motivation Spatial indexes are a key component in GIS

Large collections of geographic data Geographic operations are very complex

Sequential search is not feasible

Spatial index classification (indexable objects) Point Access Methods (PAMs)

E.g.: K-d-tree family

Spatial Access Methods (SAMs) E.g.: R-tree family

Gramado - SeCoGIS 2009 5 11th November, 2009

Motivation Typical requirements of spatial indexes:

Dynamic operations: inserts, deletes, updates, … Secondary storage management

Space consumption is a less important issue

Nowadays, some of these requirements have changed Static data collections are useful in many domains Memory hierarchy evolution

Reduction of the main memory cost New levels (flash memory)

Our goal is a new point access method Static geographic data collections Main memory: compact Efficiency similar to classical indexes

Gramado - SeCoGIS 2009 6 11th November, 2009

Outline

Motivation Compressed Data Structures PW-Tree Experiments Conclusions and Future Work

Gramado - SeCoGIS 2009 7 11th November, 2009

Compressed Data Structures Same features as classical data structures with

few storage cost Based on two very efficient bit vector operations:

rank and select Rank: returns the number of times bit b appears

in the prefix B1,i

0 1 0 0 1 1 0 1 1 0 0 0 0 0 0 1 0 0 0 0 01 2 3 4 5 6 7 8 9 1 0 11 12 13 14 1 5 1 6 1 7 1 8 19 20 2 1

B =

rank1(B,6) = 3

Gramado - SeCoGIS 2009 8 11th November, 2009

Compressed Data Structures Same features of classical data structures with

few storage cost Based on two very efficient bit vector operations:

rank and select Rank: returns the number of times bit b appears

in the prefix B1,i

rank1(B,6) = 3

B = 0 1 0 0 1 1 0 1 1 0 0 0 0 0 0 1 0 0 0 0 01 2 3 4 5 6 7 8 9 1 0 11 12 13 14 1 5 1 6 1 7 1 8 19 20 2 1

rank0(B,16) = 10

Gramado - SeCoGIS 2009 9 11th November, 2009

Compressed Data Structures Select: returns the position i of the j-th

appearance of bit b in B1,n

0 1 0 0 1 1 0 1 1 0 0 0 0 0 0 1 0 0 0 0 01 2 3 4 5 6 7 8 9 1 0 11 12 13 14 1 5 1 6 1 7 1 8 19 20 2 1

B =

select1(B,2) = 5

0 1 0 0 1 1 0 1 1 0 0 0 0 0 0 1 0 0 0 0 01 2 3 4 5 6 7 8 9 1 0 11 12 13 14 1 5 1 6 1 7 1 8 19 20 2 1

B =

select0(B,9) = 14

Gramado - SeCoGIS 2009 10 11th November, 2009

Outline

Motivation Compressed Data Structures PW-Tree Experiments Conclusions and Future Work

Gramado - SeCoGIS 2009 11 11th November, 2009

PW-tree Abstraction

N points distributed in a two-dimensional space Construction of an N x N matrix One point for each row i and one for each column j

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

1 o

2 o

3 o

4 o

5 o

6 o

7 o

8 o

9 o

10 o

11 o

12 o

13 o

14 o

15 o

16 o

Gramado - SeCoGIS 2009 12 11th November, 2009

PW-tree Abstraction

N points distributed in a two-dimensional space Construction of an N x N matrix One point for each row i and one for each column j

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 161 o2 o3 o4 o5 o6 o7 o8 o9 o

10 o11 o12 o13 o14 o15 o16 o

Column 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Row 15 1 4 11 16 12 10 13 8 7 3 5 2 14 6 9

Column 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Row 15 1 4

Column 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Row 15 1

Column 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Row 15

Column 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Row

Gramado - SeCoGIS 2009 13 11th November, 2009

PW-tree Wavelet tree construction

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 161 o2 o3 o4 o5 o6 o7 o8 o9 o10 o11 o12 o13 o14 o15 o16 o

15 1 4 11 16 12 10 13 8 7 3 5 2 14 6 9

1 0 0 1 1 1 1 1 0 0 0 0 0 1 0 1

0 1

1 4 8 7 3 5 2 6

0 0 1 1 0 1 0 1

15 11 16 12 10 13 14 9

1 0 1 0 0 1 1 0

0 1

1 4 3 2

0 1 1 0

8 7 5 6

1 1 0 0

0 1

1 2

0 1

4 3

1 0

0 1

1 2

0 1

3 4

0 1

5 6

0 1

8 7

1 0

0 1

5 6

0 1

7 8

0 1

11 12 10 9

1 1 0 0

15 16 13 14

1 1 0 0

0 1

10 9

1 0

11 12

0 1

0 1

9 10

0 1

11 12

0 1

13 14

0 1

15 16

0 1

0 1

13 14

0 1

15 16

[1, 16]

[1, 8]

[1, 4]

[1, 2]

[9, 16]

[5, 8]

[3, 4]

[9, 12] [13, 16]

15 1 4 11 16 12 10 13 8 7 3 5 2 14 6 9

1 0 0

15 1 4 11 16 12 10 13 8 7 3 5 2 14 6 9

1 0

15 1 4 11 16 12 10 13 8 7 3 5 2 14 6 9

1

15 1 4 11 16 12 10 13 8 7 3 5 2 14 6 9

[1,8] → 0

[9,16] → 1

1 4 8 7 3 5 2 61 41 15 11 16 12 10 13 14 9

Gramado - SeCoGIS 2009 14 11th November, 2009

PW-tree Obtain the row of the point that is in the column 8

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 161 o2 o3 o4 o5 o6 o7 o8 o9 o

10 o11 o12 o13 o14 o15 o16 o

15 1 4 11 16 12 10 13 8 7 3 5 2 14 6 9

1 0 0 1 1 1 1 1 0 0 0 0 0 1 0 1

0 1

1 4 8 7 3 5 2 6

0 0 1 1 0 1 0 1

15 11 16 12 10 13 14 9

1 0 1 0 0 1 1 0

0 1

1 4 3 2

0 1 1 0

8 7 5 6

1 1 0 0

0 1

1 2

0 1

4 3

1 0

0 1

1 2

0 1

3 4

0 1

5 6

0 1

8 7

1 0

0 1

5 6

0 1

7 8

0 1

11 12 10 9

1 1 0 0

15 16 13 14

1 1 0 0

0 1

10 9

1 0

11 12

0 1

0 1

9 10

0 1

11 12

0 1

13 14

0 1

15 16

0 1

0 1

13 14

0 1

15 16

[1 16]

[1, 8]

[1, 4]

[1, 2]

[9, 16]

[5, 8]

[3, 4]

[9, 12] [13, 16]

rank1(B, 8) = 6

rank0(B’’, 3) = 1

rank0(B’’’, 1) = 1

rank1(B’, 6) = 3

Gramado - SeCoGIS 2009 15 11th November, 2009

PW-tree Obtain the column of the point that is in the row 6

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 161 o2 o3 o4 o5 o6 o7 o8 o9 o

10 o11 o12 o13 o14 o15 o16 o

15 1 4 11 16 12 10 13 8 7 3 5 2 14 6 9

1 0 0 1 1 1 1 1 0 0 0 0 0 1 0 1

0 1

1 4 8 7 3 5 2 6

0 0 1 1 0 1 0 1

15 11 16 12 10 13 14 9

1 0 1 0 0 1 1 0

0 1

1 4 3 2

0 1 1 0

8 7 5 6

1 1 0 0

0 1

1 2

0 1

4 3

1 0

0 1

1 2

0 1

3 4

0 1

5 6

0 1

8 7

1 0

0 1

5 6

0 1

7 8

0 1

11 12 10 9

1 1 0 0

15 16 13 14

1 1 0 0

0 1

10 9

1 0

11 12

0 1

0 1

9 10

0 1

11 12

0 1

13 14

0 1

15 16

0 1

0 1

13 14

0 1

15 16

[1 16]

[1, 8]

[1, 4]

[1, 2]

[9, 16]

[5, 8]

[3, 4]

[9, 12] [13, 16]

select1(B’’’, 1) = 2

selecto(B’’, 2) = 4

select1(B’, 4) = 8

select0(B, 8) = 15

Gramado - SeCoGIS 2009 16 11th November, 2009

PW-tree Solve the range query q:{r[12,16], c[6,10]}

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 161 o2 o3 o4 o5 o6 o7 o8 o9 o

10 o11 o12 o13 o14 o15 o16 o

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

15 1 4 11 16 12 10 13 8 7 3 5 2 14 6 9

1 0 0 1 1 1 1 1 0 0 0 0 0 1 0 1

0 1

1 2 3 4 5 6 7 8

1 4 8 7 3 5 2 6

0 0 1 1 0 1 0 1

1 2 3 4 5 6 7 8

15 11 16 12 10 13 14 9

1 0 1 0 0 1 1 0

0 1

1 2 3 4

1 4 3 2

0 1 1 0

1 2 3 4

8 7 5 6

1 1 0 0

0 1

1 2

1 2

0 1

1 2

4 3

1 0

0 1

1 2

0 1

3 4

0 1

1 2

5 6

0 1

1 2

8 7

1 0

0 1

5 6

0 1

7 8

0 1

1 2 3 4

11 12 10 9

1 1 0 0

1 2 3 4

15 16 13 14

1 1 0 0

0 1

1 2

10 9

1 0

1 2

11 12

0 1

0 1

9 10

0 1

11 12

0 1

1 2

13 14

0 1

1 2

15 16

0 1

0 1

13 14

0 1

15 16

[1, 16]

[1, 8]

[1, 4]

[1, 2]

[9, 16]

[5, 8]

[3, 4]

[9, 12] [13, 16]

q (13, 8)

(12, 6)

rank1(B, 6-1)+1 = 4rank1(B, 10) = 6

rank1(B’, 4-1)+1 = 3rank1(B’, 6) = 3

rank0(B’’, 3) = 1

rank0(B’’’, 1) = 1

rank0(B’, 4-1)+1 = 2rank0(B’, 6) = 3

[9, 10] ¢ [12, 16]

[1, 8] ¢ [12, 16]

[9, 10]

Gramado - SeCoGIS 2009 17 11th November, 2009

PW-tree Solve the range query q:{r[12,16], c[6,10]}

Point identifiers must be returned Ordered array to store the relation between rows (or

columns) and identifiers Wavelet tree solutions are used to access this

ordered array to obtain the identifiers

Columna 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Id 65 45 43 34 78 86 98 10 44 12 14 24 28 99 84 20

Wavelet tree solution: (12, 6) y (13, 8)

Gramado - SeCoGIS 2009 18 11th November, 2009

PW-tree Two variants of this structure:

DPW-tree Point identifiers are stored in the same order of the

tree leaves The algorithm always needs to reach these leaves

UPW-tree Point identifiers are stored in the same order of the

root node The first tree traversal can be stopped without

reaching the leaves A second ascending traversal is necessary

Gramado - SeCoGIS 2009 19 11th November, 2009

Outline

Motivation Compressed Data Structures PW-Tree Experiments Conclusions and Future Work

Gramado - SeCoGIS 2009 20 11th November, 2009

Experiments (space)

Structure TotalBytes

per point

PW-tree 20N +(N lg N x 1,375)/8 23,69

R-tree 20N + 36N/(M-1) 21,24

K-d-tree 20N + 16(2h-1+(N mod 2└lg N┘)) 36,00Notes:

• R-tree: M = 30 (best experimental performance)

• K-d-tree: h = ┌lg N┐

Gramado - SeCoGIS 2009 21 11th November, 2009

Results (time) Uniform distribution

2 4 6 8 10 12 14 16

x 106

10-5

10-4

10-3

10-2

Selectivity 0.01%

Tim

e (m

s.)

Number of points

UPW-tree

DPW-tree

R*-treeSTR R-tree

K-d-tree

2 4 6 8 10 12 14 16

x 106

10-4

10-3

10-2

10-1

Selectivity 0.1%

Tim

e (m

s.)

Number of points

UPW-tree

DPW-tree

R*-treeSTR R-tree

K-d-tree

2 4 6 8 10 12 14 16

x 106

10-3

10-2

10-1

100

Selectivity 1%

Tim

e (m

s.)

Number of points

UPW-tree

DPW-tree

R*-treeSTR R-tree

K-d-tree

0 5 10 15

x 106

10-3

10-2

10-1

100

101

Selectivity 10%

Tim

e (m

s.)

Number of points

UPW-tree

DPW-tree

R*-treeSTR R-tree

K-d-tree

Gramado - SeCoGIS 2009 22 11th November, 2009

Results (time) Zipf distribution

Zipf distribution

0,0000000,0000200,0000400,0000600,0000800,0001000,0001200,0001400,0001600,000180

0.001% 0.01% 0.1% 1% 10%

Selectivity

Tim

e (

ms

.) UPW-tree

DPW-tree

R*-tree

STR R-tree

K-d-tree

0,000190

0,002190

0,004190

0,006190

0,008190

0,010190

0,012190

0,014190

Gramado - SeCoGIS 2009 23 11th November, 2009

Results (time) Gauss distribution

Gauss distribution

0,0000000,0000200,0000400,0000600,0000800,0001000,0001200,0001400,0001600,000180

0.001% 0.01% 0.1% 1% 10%

Selectivity

Tim

e (

ms

.) UPW-tree

DPW-tree

R*-tree

STR R-tree

K-d-tree

0,000019

0,050019

0,100019

0,150019

Gramado - SeCoGIS 2009 24 11th November, 2009

Results (time) North East dataset (123,593 postal addresses)

NE dataset

0,000000

0,000010

0,000020

0,000030

0,000040

0,000050

0,000060

0,000070

0,000080

0.001% 0.01% 0.1% 1% 10%

Selectivity

Tim

e (

ms

.) UPW-tree

DPW-tree

R*-tree

STR R-tree

K-d-tree

0,000090

0,002090

0,004090

0,006090

0,008090

0,010090

0,012090

0,014090

Gramado - SeCoGIS 2009 25 11th November, 2009

Results (time) Geonames gazetteer (2,693,569 populated places)

Geonames

0,000000

0,000100

0,000200

0,000300

0,000400

0,000500

0,000600

0.001% 0.01% 0.1% 1% 10%

Selectivity

Tim

e (

ms

.) UPW-tree

DPW-tree

R*-tree

STR R-tree

K-d-tree

0,000700

0,100700

0,200700

0,300700

0,400700

0,500700

0,600700

Gramado - SeCoGIS 2009 26 11th November, 2009

Outline

Motivation Compressed Data Structures PW-Tree Experiments Conclusions and Future Work

Gramado - SeCoGIS 2009 27 11th November, 2009

Conclusions and Future Work Conclusions:

A new PAM based on compressed data structures (wavelet tree, rank, select)

Two variants (DPW-tree, UPW-tree) Good experimental performance

Future Work: Algorithms to solve other queries (k-NN, spatial join) Support for dynamic operations New spatial compressed data structures:

Spatial access methods based on wavelet trees Balanced representation of a K-d-tree

A New Point Access Method based on Wavelet Trees

Contact: Diego SecoDiego [email protected]@udc.es