indexing multidimensional feature spaces overview of multidimensional index structure hybrid tree,...

81
Indexing Multidimensional Feature Spaces Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction, Chakrabarti et. al. VLDB 2000

Upload: dina-miller

Post on 24-Dec-2015

230 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction,

Indexing Multidimensional Feature SpacesIndexing Multidimensional Feature Spaces

Overview of Multidimensional Index Structure

Hybrid Tree, Chakrabarti et. al. ICDE 1999

Local Dimensionality Reduction, Chakrabarti et. al. VLDB 2000

Page 2: Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction,

Queries over Feature SpacesQueries over Feature Spaces

• Consider a d-dimensional feature space– color histogram, texture, …

• Nature of Queries– range queries: objects that reside within the region specified in the query

– K-nearest neighbor queries: objects that are closest to a query object based on a distance metric

– Approx. nearest neighbor queries: retrieved object is within (1+ epsilon) of the real nearest neighbor.

– All-pair (similarity join) queries: retrieve all pairs of objects within a epsilon threshold.

• A search algorithm may include:– false positives: objects that do not meet the query condition, but are

retrieved anyway. We tend to minimize false positives

– false negatives: objects that meet the query condition but are not returned. Usually, approaches avoid false negatives

Page 3: Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction,

Approach: Utilize Single Dimensional IndexApproach: Utilize Single Dimensional Index

• Index on attributes independently

• Project query range to each attribute determine pointers.

• Intersect pointers

• go to the database and retrieve objects in the intersection.

May result in very high I/O cost

Page 4: Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction,

Multiple Key IndexMultiple Key Index

• Index on one attribute provides pointers to an index on the other

Index on first attribute

Index on second

attribute

•Cannot support partial match queries on second attribute

•performance of range search not much better compared to independent attribute approach

•the secondary indices may be of different sizes -- specifically some of them may be very small

Page 5: Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction,

R-tree Data StructureR-tree Data Structure

• Extension of B-tree to multidimensional space.

• Paginated, balanced, guaranteed storage utilization.

• Can support both point data and data with spatial extent

• Groups objects into possibly overlapping clusters (rectangles in our case)

• Search for range query proceeds along all paths that overlap with the query.

Page 6: Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction,

R-tree Insert Object ER-tree Insert Object E

• Step I1– Chooseleaf L to Insert E /* find position to insert*/

• Step I2– If L has room install E

– Else SplitNode(L)

• Step I3:– Adjust Tree /* propagate Changes*/

• Step I4:– if node split propagates to root adjust height of tree

Page 7: Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction,

ChooseLeafChooseLeaf

• Step CL1: – Set N to be root

• Step CL2: – If N is a leaf return N

• Step CL3: – If N is not a root, let F be an entry whose rectangle needs least

enlargement to include object• break ties by choosing smaller rectangle

• Step CL4 – Set N to be child node pointed by entry F

– goto Step CL2

Page 8: Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction,

Split NodeSplit Node

• Given a node split it into two nodes which are each atleast half full

• Multiple Objectives:– minimize overlap

– minimize covered area

• R-tree minimizes covered area

• What is an optimal criteria???

Minimize overlap Minimize covered area

Page 9: Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction,

Minimizing Covered AreaMinimizing Covered Area

• Group objects into 2 parts such that the covered area is minimized

• NP Hard!!

• Hence use heuritics

• Two heuristics explored– quadratic and linear

Page 10: Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction,

Basic Split StrategyBasic Split Strategy

• /* Divide the set of M+1 entries into 2 groups G1 and G2 */

• PickSeeds for G1 and G2

• Invoke PickNext to assign an object to a group recursively until either all objects assigned or one of the groups becomes half full.

• If one group gets half full assign rest of the objects to the other group.

Page 11: Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction,

Quadratic SplitQuadratic Split

• PickSeed:– for each pair of entries E1 and E2 compose a rectangle J including

E1.rect and E2.rect• let d = area(J) - area(E1.rect) - area(E2.rect) /* d is wasted space */

– Choose the most wasteful pair with largest d as seeds for groups G1 and G2.

• PickNext /*select next entry to put in a group */– Determine cost of putting each entry in the group G1 and G2

• for each unassigned entry calculate

• d1 = area increase required in the covering rectangle in Group G1 to include the entry

• d2= area increase required in the covering rectangle in Group G2 to include the entry.

– Select entry with greatest preference for a group• choose any entry with the maximum difference between d1 and d2

Page 12: Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction,

Linear SplitLinear Split

• PickSeed– find extreme rectangles along each dimension

• find entries with the highest low side and the lowest high side

– record the separation

– Normalize the separation by width of extent along the dimension

– Choose as seeds the pair that has the greatest normalized distance along any dimension

• PickNext– randomly choose entry to assign

Page 13: Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction,

R-tree Search (Range Search on range S)R-tree Search (Range Search on range S)

• Start from root

• If node T is not leaf– check entries E in T to determine if E.rectangle overlaps S

– for all overlapping entries invoke search recursively

• If T is leaf– check each entry to see if it entry satisfies range query

Page 14: Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction,

R-tree DeleteR-tree Delete

• Step D1– find the object and delete entry

• Step D2 – Condense Tree

• Step D3– if root has 1 node shorten tree height

Page 15: Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction,

Condense TreeCondense Tree

• If node is underful– delete entry from parent and add to a set Q

• Adjust bounding rectangle of parent

• Do the above recursively for all levels

• Reinsert all the orphaned entries – insert entries at the same level they were deleted.

Page 16: Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction,

Other Multidimensional Data StructuresOther Multidimensional Data Structures

• Many generalizations of R-tree– different splitting criteria– different shapes of clusters (e.g., d-dimensional spheres)– adding redundancy to reduce search cost:

• store objects in multiple rectangles instead of a single rectangle to reduce cost of retrieval. But now insert has to store objects in many clusters. This strategy also increases overlap causing search performance to detoriate.

• Space Partitioning Data Structures– unlike R-tree which group objects into possibly overlapping clusters,

these methods attempt to partition space into non-overlapping regions.– E.g., KD tree, quad tree, grid files, KD-Btree, HB-tree, hybrid tree.

• Space filling curves– superimpose an ordering on multidimensional space that preserves

proximity in multidimensional space. (Z-ordering, hilbert ordering)– Use a B-tree as an index on that ordering

Page 17: Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction,

KD-treeKD-tree

• A main memory data structure based on binary search trees– can be adapted to block model of storage (KD-Btree)

• Levels rotate among the dimensions, partitioning the space based on a value for that dimension

• KD-tree is not necessarily balanced.

Page 18: Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction,

KD-Tree ExampleKD-Tree Example

X=5

y=5 y=6

x=3

y=2

x=8 x=7

X=5 X=8

X=7X=3

Y=2

Y=6

Page 19: Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction,

KD-Tree OperationsKD-Tree Operations

• Search: – straightforward…. Just descend down the tree like binary search

trees.

• Insertion: – lookup record to be inserted, reaching the appropriate leaf.

– If room on leaf, insert in the leaf block

– Else, find a suitable value for the appropriate dimension and split the leaf block

Page 20: Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction,

Adapting KD Tree to Block ModelAdapting KD Tree to Block Model

• Similar to B-tree, tree nodes split many ways instead of two ways– Risk:

• insertion becomes quite complex and expensive.

• No storage utilization guarantee since when a higher level node splits, the split has to be propagated all the way to leaf level resulting in many empty blocks.

• Pack many interior nodes (forming a subtree) into a block.– Risk

• it may not be feasible to group nodes at lower level into a block productively.

• Many interesting papers on how to optimally pack nodes into blocks recently published.

Page 21: Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction,

Quad TreeQuad Tree

• Nodes split along all dimensions simultaneously

• Division fixed: by quadrants

• As with KD-tree we cannot make quadtree levels uniform

Page 22: Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction,

Quad Tree ExampleQuad Tree Example

X=5 X=8

X=7X=3SW

SE NE

NW

Page 23: Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction,

Quad Tree OperationsQuad Tree Operations

• Insert:– Find Leaf node to which point belongs

– If room, put it there

– Else, make the leaf an interior node and give it leaves for each quadrant. Split the points among the new leaves.

• Search:– straighforward… just descend down the right subtree

Page 24: Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction,

Grid FilesGrid Files

• Space Partitioning strategy but different from a tree.

• Select dividers along each dimension. Partition space into cells

• Unlike KD-tree dividers cut all the way.

• Each cell corresponds to 1 disk page.

• Many cells can point to the same page.

• Cell directory potentially exponential in the number of dimensions

Page 25: Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction,

Grid File ImplementationGrid File Implementation

• Maintain linear scales for each dimension that contain split positions for the dimension

• Cell directory implemented as a multidimensional array.– /* can be large and may not fit in memory */

Page 26: Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction,

Grid File SearchGrid File Search

• Exact Match Search: at most 2 I/Os assuming linear scales fit in memory.– First use liner scales to determine the index into the cell directory

– access the cell directory to retrieve the bucket address (may cause 1 I/O if cell directory does not fit in memory)

– access the appropriate bucket (1 I/O)

• Range Queries:– use linear scales to determine the index into the cell directory.

– Access the cell directory to retrieve the bucket addresses of buckets to visit.

– Access the buckets.

Page 27: Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction,

Grid File InsertGrid File Insert

• Determine the bucket into which insertion must occur.

• If space in bucket, insert.

• Else, split bucket– how to choose a good dimension to split?

• If bucket split causes a cell directory to split do so and adjust linear scales.

• /* notice that cell directory split results in p^(d-1) new entries to be created in cell directory */

• insertion of these new entries potentially requires a complete reorganization of the cell directory--- expensive!!!

Page 28: Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction,

Grid File InsertGrid File Insert

• Inserting a new split position will require the cell directory to increase by 1 column. In d-dim

space, it will cause p^(d-1) new entries to

be created

Page 29: Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction,

Space Filling CurveSpace Filling Curve

• Assumption – finite precision in representing each coordinate.

00 01 10 11

00

01

10

11

A B

C

Z(A) = shuffle(x_A, y_A) = shuffle(00,11)

= 0101 = 5

Z(B) = 11 = 3

(common prefix to all its blocks)

Z(C1) = 0010 = 2

Z(C2) = 1000 = 8

Page 30: Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction,

Deriving Z-Values for a RegionDeriving Z-Values for a Region

• Obtain a quad-tree decomposition of an object by recursively dividing it into blocks until blocks are homogeneous.

00 10

1101

0001

11

0011

Objects representation

is

0001, 0011,01

Page 31: Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction,

Disk Based StorageDisk Based Storage

• For disk storage, represent object based on its Z-value

• Use a B-tree index.

• Range Query:– translate query range to Z values

– search B-tree with Z-values of data regions for matches

Page 32: Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction,

Nearest Neighbor SearchNearest Neighbor Search

• Retrieve the nearest neighbor of query point Q

• Simple Strategy:– convert the nearest neighbor search to range search.

– Guess a range around Q that contains at least one object say O• if the current guess does not include any answers, increase range size until

an object found.

– Compute distance d’ between Q and O

– re-execute the range query with the distance d’ around Q.

– Compute distance of Q from each retrieved object. The object at minimum distance is the nearest neighbor!!! Why?

– Issues: how to guess range, the retrieval may be sub-optimal if incorrect range guessed. Becomes a problem in high dimensional spaces.

Page 33: Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction,

Nearest Neighbor Search using Range SearchesNearest Neighbor Search using Range Searches

QA

bInitial range search

Revised range search

Distance between

Q and A

A optimal strategy that results in minimum number of I/Os possible

using priority queues.

Page 34: Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction,

Alternative Strategy to Evaluating K-NN Alternative Strategy to Evaluating K-NN

• Let Q be the query point.

• Traverse nodes in the data structure in the order of MINDIST(Q,N), where

• MINDIST(Q,N) = dist(Q,N), if N is an object.

• MINDIST(Q,N) = minimum distance between Q and any object in N, if N is an interior node.

Mindist(Q, A)

Mindist(Q,B)

Min

dist

(Q,C

)

A

B

C

Q

Page 35: Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction,

MINDIST Between Rectangle and PointMINDIST Between Rectangle and Point

elseQnTifTnSifSn

wherenQNQMINDIST

TTTTandSSSS

whereTSN

QQQQ

ii

iii

iii

i

d

i

d

d

d

i

,Q ,Q ,

,||),(

,...,,,...,,

],,[

,...,,

i

i

2

1

21

21

21

S

T

Q

Q

Q

Page 36: Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction,

Generalized Search TreesGeneralized Search Trees

• Motivation:– disparate applications require different data structures and access

methods.

– Requires separate code for each data structure to be integrated with the database code

• too much effort.

• Vendors will not spend time and energy unless application very important or data structure has general applicability.

• Generalized search trees abstract the notion of data structure into a template. – Basic observation: most data structures are similar and a lot of book

keeping and implementation details are the same.

– Different data structures can be seen as refinements of basic GiST structure. Refinements specified by providing a registering a bunch of functions per data structure to the GiST.

Page 37: Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction,

GiST supports extensibility both in terms of data types and queriesGiST supports extensibility both in terms of data types and queries

• GiST is like a “template” - it defines its interface in terms of ADT rather than physical elements (like nodes, pointers etc.)

• The access method (AM) can customize GiST by defining his or her own ADT class i.e. you just define the ADT class, you have your access method implemented!

• No concern about search/insertion/deletion, structural modifications like node splits etc.

Page 38: Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction,

Integrating Multidimensional Index Structures as AMs in DBMSIntegrating Multidimensional Index Structures as AMs in DBMS

Data nodes containing points

x3

x>5andy>4

x>4and

y3

x=3 x=4

x=5

y=3

y=4

x+y=12

y=5

x=6

y5 y>5x+y

12

x+y>12

x6 x>6

Generalized Search Trees (GiSTs)

Page 39: Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction,

Problems with Existing Approaches for Problems with Existing Approaches for Feature IndexingFeature Indexing

• Very high dimensionality of feature spaces -- e.g., shape may define a 100-D space.– Traditional multidim. data structures perform worse than linear

scan at such high dimensionality. (dimensionality curse)• Arbitrary distance functions-- e.g., distance functions may change

across iterations of relevance feedback.– Traditional multidim. data structures support a fixed distance

measure -- usually euclidean (L2) or Lmax.• No support for Multi-point Queries -- as in query expansion.

– Executing K-NN for each query point and merging results to generate K-NN for multi-point query is very expensive.

• No Support for Refinement– query in the following iterations do not diverge greatly from query

in previous iterations. Effort spent in previous iterations should be exploited for evaluating K-NN in future iterations

Page 40: Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction,

High Dimensional Feature IndexingHigh Dimensional Feature Indexing

Dimensionality Reduction

• transform points in high dim. space to low dim. space

• works well when data correlated into a few dimensions only

• difficult to manage in dynamic environments

Multidim. Data Structures• design data structures that

scale to high dim. spaces• Existing proposals perform

worse than linear scan over >= 10 dim. Spaces [Weber, et

al., VLDB 98]• Fundamental Limitation dimensionality beyond which

linear scan wins over indexing! (approx. 610)

Page 41: Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction,

Classification of Multidimensional Index Classification of Multidimensional Index StructuresStructures

• Data Partitioning (DP)– Bounding Region (BR)

Based e.g., R-tree, X-tree, SS-tree, SR-tree, M-tree

– All k dim. used to represent partitioning

– Poor scalability to dimensionality due to high degree of overlap and low fanout at high dimensions

– seq. scan wins for > 10D

• Space Partitioning(SP)– Based on disjoint

partitioning of space e.g., KDB-tree, hB-tree, LSDh-tree, VP tree, MVP tree

– no overlap and fanout independent of dimensions

– Poor scalability to dimensionality due to either poor storage utilization or redundant information storage requirements.

Page 42: Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction,

Hybrid Tree: Hybrid Tree: Space Partitioning (SP) instead of Data Partitioning (DP)Space Partitioning (SP) instead of Data Partitioning (DP)

R1 R2

R3

R4

R1 R2 R3 R4

Data Points

Data Points

Data Points

Data Points

dim=2pos=3

dim=1pos=3

dim=2pos=2

A B C D

Dim1

Dim2

3

32

0,0

A

B

C

D

Non-leafnodes ofhybrid treeorganized as kd-tree

Page 43: Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction,

Splitting of Non-Leaf Nodes (Easy case)Splitting of Non-Leaf Nodes (Easy case)

(0,0) 4 62

34

A

B C

D E

F

dim=1 pos=4

dim=2 pos=3

dim=1 pos=2

dim=1 pos=6

A

B C D E

F

dim=2 pos=4

A

B C

D E

F

dim=1 pos=4

dim=2 pos=3

dim=1 pos=2

A

B C

dim=1 pos=6

D E

F

dim=2 pos=4

Clean split possible without violating node utilization

Page 44: Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction,

Splitting of Non-Leaf Nodes (Difficult case)Splitting of Non-Leaf Nodes (Difficult case)

Always clean split;Downward

cascading splits(empty nodes)

Allow Overlap(avoid by relaxing node util, otherwise minimize overlap)

(Hybrid Tree)

Clean split not possible without violating node util.

Complex splits(space overhead;

tree becomes large)

dim=1pos=4

dim=2pos=3

dim=1pos=2

dim=1pos=6

dim=2pos=2

dim=2pos=5

dim=1pos=7

A

B C D

E F

G

H I

dim=2pos=4

Page 45: Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction,

Splitting of Non-Leaf Nodes (Difficult case)Splitting of Non-Leaf Nodes (Difficult case)

(0,0) 4 62 7

234

5

A

B C

D

E

F

G

H I

dim=1pos=4,4

dim=2pos=3,3

dim=1pos=2,2

dim=1pos=6,6

dim=2pos=2,2

dim=2pos=5,5

dim=1pos=7,7

A

B C D

E F

G

H I

dim=2pos=3,4

dim=2pos=4,4

dim=1pos=4,4

dim=1pos=2,2A

B C

dim=1pos=6,6

dim=2pos=2,2D

E F

dim=1pos=4,4

dim=2pos=5,5

dim=1pos=7,7G

H I

A

B C

D

E

F

G

H I

Page 46: Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction,

Choosing Split dimension and position: Choosing Split dimension and position: EDA (Expected Disk Accesses) AnalysisEDA (Expected Disk Accesses) Analysis

Prob. of range query accessing node (assuming (0,1) space

and uniform query distribution)

Prob. of range query accessing both nodes after split

(increase in EDA)

Node BR

Node BR expandedby (r/2) on each sidealong each dimension(Minkowski Sum)

r

Split node along this

Consider a range (cube) query, side length r along each dimension

Choose split dimension and position that minimizes increase in EDA

w w+r

Page 47: Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction,

Choosing Split dimension and positionChoosing Split dimension and position(based on EDA analysis)(based on EDA analysis)

• Data Node Splitting

– Spilt dimension: split along maximum spread dimension

– Split position: split as close to the middle as possible (without violating node utilization)

• Index Node Splitting:

– Split dimension: argminj P(r) (wj + r)/ (sj + r) dr

• depends of the distribution of the query size

• argminj (wj + R)/ (sj + R) when all queries are cubes with side length R

– Split position: avoid overlap if possible, else minimize as much overlap as possible without violating utilization constraints

Page 48: Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction,

Dead Space EliminationDead Space Elimination

R1 R2

R3

R4

A

B

C

D

A

B

C

D

Data Partitioning (R-tree): No dead space

Space Partitioning (Hybrid tree): Without dead space elimination

Space Partitioning (Hybrid tree): With dead space elimination

Page 49: Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction,

Dead Space EliminationDead Space Elimination

000

001

010

011

100

101

110

111

000 001 010 011 100 101 110 111

• Live space encoding using 3 bit precision (ELSPRECISION=3)

• Encoded Live Space (ELS) BR = (001,001,101,111)

• Bits required = 2*numdims*ELSPRECISION

• Compression = ELSPRECISION/32

• Only applied to leaf nodes

Page 50: Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction,

Tree operationsTree operations

• Search: – Point, Range, NN-search, distance-based search as in DP-techniques

– Reason: BR representation can be derived from kd-tree representation

– Exploit tree organization (pruning) for fast intra-node search

• Insertion: – recursively choose space partition that contains the point

– break tries arbitrarily

– no volume computation (otherwise floating point exception at high dims)

• Deletion: – details in thesis

Page 51: Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction,

Mapping of kd-tree representation to Bounding Rectangle (BR) Mapping of kd-tree representation to Bounding Rectangle (BR) representationrepresentation

dim=2pos=3,4

dim=1pos=4,4

dim=1pos=2,2

A

B C

dim=1pos=6,6

dim=2pos=2,2D

E F

dim=1pos=4,4

dim=2pos=5,5

dim=1pos=7,7G

H I

A

B C

D

E

F

G

H I Search algorithms

developed for R-tree can be used directly

Page 52: Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction,
Page 53: Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction,

Other Queries (Lp metrics and weights)Other Queries (Lp metrics and weights)

1

2

312

3 1

2

3

Range Queries

k-NN queries

Euclidean distance Weighted Euclidean Weighted Manhattan

Page 54: Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction,

Advantages of Hybrid TreeAdvantages of Hybrid Tree

• More scalable to high dimensionalities than:

– DP techniques (R-tree like index structures)• Fanout independent of dimensionality: high fanout even at high dims

• Faster intranode search due to kd-tree-based organization

• No overlap at lowest level, low overlap at higher levels

– SP techniques• Guaranteed node utilization

• No costly cascading splits

• EDA-optimal choice of splits

• Supports arbitrary distance functions

Page 55: Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction,

ExperimentsExperiments

• Effect of ELS encoding

• Test scalability of hybrid tree to high dimensionalities

– Compare performance of hybrid tree with SR-tree (data partitioning), hB-tree

(space partitioning) and sequential scan

• Data Sets

– Fourier Data set (16-d Fourier vectors, 1.2 million)

– Color Histograms for COREL images (64-d color histograms from 70K images)

Page 56: Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction,

Effect of ELS optimization

0100200300400500600700800

0 4 8 16

#bits used

Ran

do

m d

isk

accesses

16-d data32-d data64-d data

Experimental ResultsExperimental Results

0500

10001500200025003000

Random disk

accesses

16 32 64

# dimensions

Comparison of various techniques in I / O cost

Hybrid TreehB-treeSR-treeLinear Scan

05

101520253035

CPU time (sec)

16 32 64

# dimensions

Comparison of various techniques CPU time

Hybrid TreehB-treeSR-treeLinear Scan

Factor of Sequential IOto Random IO accounted for

Page 57: Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction,

Summary of ResultsSummary of Results

• Hybrid Tree scales well to high dimensionalities

– Outperforms linear scan even at 64-d (mainly due to significantly lower CPU cost)

• Order of magnitude better than SR-tree (DP) and hB-tree (SP) both in terms

of I/O and CPU costs at all dimensionalities

– Performance gap increases with the increase in dimensionality

• Efficiently supports arbitrary distance functions

Page 58: Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction,

Exploiting Correlation in DataExploiting Correlation in Data

Dimensionality Curse

0

500

1000

1500

2000

2500

3000

16 32 64 128 256

# dimensions

Random disk

accesses

Hybrid TreeLinear ScanDR+Hybrid Tree

• Dimensionality curse persists

• To achieve further scalability, dimensionality reduction (DR) commonly used in conjuction with index structures

• Exploit correlations in high dimensional data

Expected graph (hand drawn)

Page 59: Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction,

Dimensionality ReductionDimensionality Reduction

• First perform Principal Component Analysis

(PCA), then build index on reduced space

• Distances in reduced space lower bound

distances in original space

• Range queries:

– map point, range query with same range,

eliminate false positives

– k-NN query (a bit more complex)

• DR increases efficiency, not quality of answers

First Principal Component (PC)

r

r

Reduced space

Page 60: Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction,

Global Dimensionality Reduction (GDR)Global Dimensionality Reduction (GDR)

First PrincipalComponent (PC) First PC

•works well only when data is globally correlated

•otherwise too many false positives result in high

query cost

•solution: find local correlations instead of global

correlation

Page 61: Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction,

Local DimensionalityLocal Dimensionality Reduction (LDR)Reduction (LDR)

First PC

GDR LDR

First PC of Cluster1

Cluster1

Cluster2

First PC of Cluster2

Page 62: Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction,

Overview of LDR TechniqueOverview of LDR Technique

• Identify Correlated Clusters in dataset– Definition of correlated clusters

– Bounding loss of information

– Clustering Algorithm

• Indexing the Clusters– Index Structure

– Point Search, Range search and k-NN search

– Insertion and deletion

Page 63: Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction,

Correlated ClusterCorrelated Cluster

Second PC(eliminated dim.)

Centroid of cluster (projection of mean on eliminated dim)

First PC(retained dim.)

Mean of all points in cluster

A set of locally correlated points = <PCs, subspace dim, centroid, points>

Page 64: Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction,

Reconstruction DistanceReconstruction Distance

Centroid of cluster

First PC(retained dim)

Second PC(eliminated dim)

Point QProjection of Q on eliminated dim

ReconstructionDistance(Q,S)

Page 65: Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction,

Reconstruction Distance BoundReconstruction Distance Bound

Centroid

First PC(retained dim)

Second PC(eliminated dim)

MaxReconDist

MaxReconDist

ReconDist(P, S) MaxReconDist, P in S

Page 66: Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction,

Other constraintsOther constraints

• Dimensionality bound: A cluster must not retain any more dimensions necessary and subspace dimensionality MaxDim

• Size bound: number of points in the cluster MinSize

Page 67: Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction,

Clustering Algorithm Clustering Algorithm Step 1: Construct Spatial ClustersStep 1: Construct Spatial Clusters

• Choose a set of well-scattered points as centroids (piercing set) from random sample

• Group each point P in the dataset with its closest centroid C if the Dist(P,C)

Page 68: Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction,

Clustering Algorithm Clustering Algorithm Step 2: Choose PCs for each clusterStep 2: Choose PCs for each cluster

• Compute PCs

Page 69: Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction,

Clustering AlgorithmClustering AlgorithmStep 3: Compute Subspace DimensionalityStep 3: Compute Subspace Dimensionality

0

0.2

0.4

0.6

0.8

1

0 2 4 6 8 10 12 14 16

#dims retained

Fra

c p

oin

ts o

be

yin

g

rec

on

s.

bo

un

d

• Assign each point to cluster that needs min dim. to accommodate it

• Subspace dim. for each cluster is the min # dims to retain to keep most points

Page 70: Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction,

Clustering Algorithm Clustering Algorithm Step 4: Recluster pointsStep 4: Recluster points

• Assign each point P to the cluster S such that ReconDist(P,S)

MaxReconDist

• If multiple such clusters, assign to first cluster (overcomes “splitting” problem)

Emptyclusters

Page 71: Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction,

Clustering algorithmClustering algorithmStep 5: Map pointsStep 5: Map points

• Eliminate small clusters

• Map each point to subspace (also store reconstruction dist.)

Map

Page 72: Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction,

Clustering algorithmClustering algorithmStep 6: IterateStep 6: Iterate

• Iterate for more clusters as long as new clusters are being found among outliers

• Overall Complexity: 3 passes, O(ND2K)

Page 73: Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction,

Experiments (Part 1)Experiments (Part 1)

• Precision Experiments:

– Compare information loss in GDR and LDR for same reduced dimensionality

– Precision = |Orig. Space Result|/|Reduced Space Result| (for range queries)

– Note: precision measures efficiency, not answer quality

Page 74: Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction,

DatasetsDatasets

• Synthetic dataset:– 64-d data, 100,000 points, generates clusters in different subspaces (cluster sizes and

subspace dimensionalities follow Zipf distribution), contains noise

• Real dataset:– 64-d data (8X8 color histograms extracted from 70,000 images in Corel collection),

available at http://kdd.ics.uci.edu/databases/CorelFeatures

Page 75: Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction,

Precision Experiments (1)Precision Experiments (1)

0

0.5

1

Prec.

0 0.5 1 2

Skew in c luster size

Sensitivity of prec. to skew

LDR GDR

0

0.5

1

Prec.

1 2 5 10

Number of c lusters

Sensitivity of prec. to num clus

LDR GDR

Page 76: Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction,

Precision Experiments (2)Precision Experiments (2)

0

0.5

1

Prec.

0 0.02 0.05 0.1 0.2

Degree of Correlation

Sensitivity of prec. to correlation

LDR GDR

0

0.5

1

Prec.

7 10 12 14 23 42

Reduced dim

Sensitivity of prec. to reduced dim

LDR GDR

Page 77: Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction,

Index structureIndex structure

Root containing pointers to root of each cluster index (also stores PCs and subspace dim.)

Index

on

Cluster 1

Index

on

Cluster K

Set of outliers (no index: sequential scan)

Properties: (1) disk based

(2) height 1 + height(original space index) (3) almost balanced

Page 78: Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction,

Experiments (Part 2)Experiments (Part 2)

• Cost Experiments: – Compare linear scan, Original Space Index(OSI), GDR and LDR in terms of I/O and CPU costs.

We used hybrid tree index structure for OSI, GDR and LDR.

• Cost Formulae:– Linear Scan: I/O cost (#rand accesses)=file_size/10, CPU cost

– OSI: I/O cost=num index nodes visited, CPU cost

– GDR: I/O cost=index cost+post processing cost (to eliminate false positives), CPU cost

– LDR: I/O cost=index cost+post processing cost+outlier_file_size/10, CPU cost

Page 79: Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction,

I/O Cost (#random disk accesses)I/O Cost (#random disk accesses)

I/O cost comparison

0

500

1000

1500

2000

2500

3000

7 10 12 14 23 42 50 60

Reduced dim

#rand disk

acc

LDR

GDR

OSI

Lin Scan

Page 80: Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction,

CPU Cost (only computation time)CPU Cost (only computation time)

CPU cost comparison

0

20

40

60

80

7 10 12 14 23 42

Reduced dim

CPU cost

(sec)

LDR

GDR

OSI

Lin Scan

Page 81: Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction,

Summary of LDRSummary of LDR

• LDR is a powerful dimensionality reduction technique for high dimensional data

– reduces dimensionality with lower loss in distance information compared to GDR

– achieves significantly lower query cost compared to linear scan, original space index and GDR

• LDR is a general technique to deal with high dimensionality

– our experience shows high dimensional datasets often have local correlations - LDR is the only

technique that can discover/exploit it

– applications beyond indexing: selectivity estimation, data mining etc. on high dimensional data

(currently exploring)