clique and sting

171
CLIQUE and STING Dr S.Natarajan Professor and Key Resource Person Department of Information Science and Engineering PES Institute of Technology Bengaluru [email protected] 995280225

Upload: subramanyam-natarajan

Post on 15-Apr-2017

115 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Clique and sting

CLIQUE and STING

Dr S.NatarajanProfessor and Key Resource Person

Department of Information Science and Engineering

PES Institute of TechnologyBengaluru

[email protected]

Page 2: Clique and sting

High-dimensional integration• High-dimensional integrals in statistics, ML, physics

• Expectations / model averaging• Marginalization• Partition function / rank models / parameter learning

• Curse of dimensionality:

• Quadrature involves weighted sum over exponential number of items (e.g., units of volume)

L L2 L3 Ln

n dimensional hypercube

L4

2

Page 3: Clique and sting

High Dimensional Indexing Techniques

• Index trees (e.g., X-tree, TV-tree, SS-tree, SR-tree, M-tree, Hybrid Tree)– Sequential scan better at high dim. (Dimensionality Curse)

• Dimensionality reduction (e.g., Principal Component Analysis (PCA)), then build index on reduced space

Page 4: Clique and sting

Datasets• Synthetic dataset:

– 64-d data, 100,000 points, generates clusters in different subspaces (cluster sizes and subspace dimensionalities follow Zipf distribution), contains noise

• Real dataset:– 64-d data (8X8 color histograms extracted from 70,000

images in Corel collection), available at http://kdd.ics.uci.edu/databases/CorelFeatures

Page 5: Clique and sting

5

Preliminaries – Nearest Neighbor Search

• Given a collection of data points and a query point in m-dimensional metric space, find the data point that is closest to the query point

• Variation: k-nearest neighbor

• Relevant to clustering and similarity search

• Applications: Geographical Information Systems, similarity search in multimedia databases

Page 6: Clique and sting

6

NN Search Con’t

Source: [2]

Page 7: Clique and sting

7

Problems with High Dimensional Data

• A point’s nearest neighbor (NN) loses meaning

Source: [2]

Page 8: Clique and sting

8

Problems Con’t

• NN query cost degrades – more strong candidates to compare with

• In as few as 10 dimensions, linear scan outperforms some multidimensional indexing structures (e.g. SS tree, R* tree, SR tree)

• Biology and genomic data can have dimensions in the 1000’s.

Page 9: Clique and sting

9

Problems Con’t

• The presence of irrelevant attributes decreases the tendency for clusters to form

• Points in high dimensional space have high degree of freedom; they could be so scattered that they appear uniformly distributed

Page 10: Clique and sting

10

Problems Con’t• In which cluster does the query point fall?

Page 11: Clique and sting

11

The Curse• Refers to the decrease in performance of query

processing when the dimensionality increases

• The focus of this talk will be on quality issues of NN search and on not performance issues

• In particular, under certain conditions, the distance between the nearest point and the query point equals the distance between the farthest and query point as dimensionality approaches infinity

Page 12: Clique and sting

12

Curse Con’t

Source: N. Katayama, S. Satoh. Distinctiveness Sensitive Nearest Neighbor Search for Efficient Similarity Retrieval of Multimedia Information. ICDE Conference, 2001.

Page 13: Clique and sting

13

Unstable NN-QueryA nearest neighbor query is unstable for a given > 0 if the distance from the query point to most data points is less than (1+) times the distance from the query point to its nearest neighbor

Source: [2]

Page 14: Clique and sting

14

Theorem Con’t

Source: [2]

Page 15: Clique and sting

15

Theorem Con’t

Source: [1]

Page 16: Clique and sting

16

Rate of Convergence

• At what dimensionality does NN-queries become unstable. Not easy to answer, so experiments were performed on real and synthetic data.

• If conditions of theorem are met, DMAXm/DMINm should decrease with

increasing dimensionality

Page 17: Clique and sting

17

Conclusions• Make sure enough contrast between query and

data points. If distance to NN is not much different from average distance, the NN may not be meaningful

• When evaluating high-dimensional indexing techniques, should use data that do not satisfy Theorem 1 and should compare with linear scan

• Meaningfulness also depends on how you describe the object that is represented by the data point (i.e., the feature vector)

Page 18: Clique and sting

18

Other Issues

• After selecting relevant attributes, the dimensionality could still be high

• Reporting cases when data does not yield any meaningful nearest neighbor, i.e. indistinctive nearest neighbors

Page 19: Clique and sting

Sudoku• How many ways to fill a valid sudoku square?

• Sum over 981 ~ 1077 possible squares (items)• w(x)=1 if it is a valid square, w(x)=0 otherwise

• Accurate solution within seconds:• 1.634×1021 vs 6.671×1021

1 2

….

?

19

Page 20: Clique and sting

MDL

Page 21: Clique and sting

Minimum Description Length Principle Occam’s razor: prefer the simplest hypothesis

Simplest hypothesis hypothesis with shortest description length

Minimum description length Prefer shortest hypothesis

LC (x) is the description length for message x under coding scheme c

1 2arg min ( ) ( | )MDL C C

h Hh L h L D h

# of bits to encode hypothesis h

# of bits to encode data D given h

Complexity of Model

# of Mistakes

Page 22: Clique and sting

MDL: Interpretation of –logP(D|H)+K(H)

Interpreting –logP(D|H)+K(H)K(H) is mimimum description length of H-logP(D|H) is the mimimum description length of D

(experimental data) given H. That is, if H perfectly explains D, then P(D|H)=1, then this term is 0. If not perfect, then this is interpreted as the number of bits needed to encode errors.

MDL: Minimum Description Length principle (J. Rissanen): given data D, the best theory for D is the theory H which minimizes the sum of Length of encoding HLength of encoding D, based on H (encoding errors)

Page 23: Clique and sting

CLIQUE: A Dimension-Growth Subspace Clustering MethodFirst dimension growth subspace clustering

algorithmClustering starts at single-dimension

subspace and move upwards towards higher dimension subspace

This algorithm can be viewed as the integrationDensity based and Grid based algorithm

Page 24: Clique and sting

CLIQUE (CLstering In QUEst)

• Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98). • Automatically identifying subspaces of a high dimensional data

space that allow better clustering than original space • CLIQUE can be considered as both density-based and grid-based

– It partitions each dimension into the same number of equal length intervals– It partitions an m-dimensional data space into non-overlapping rectangular

units– A unit is dense if the fraction of total data points contained in the unit

exceeds the input model parameter– A cluster is a maximal set of connected dense units within a subspace

Page 25: Clique and sting

Definitions That Need to Be Known

Unit : After forming a grid structure on the space, each rectangular cell is called a Unit.

Dense: A unit is dense, if the fraction of total data points contained in the unit exceeds the input model parameter.

Cluster: A cluster is defined as a maximal set of connected dense units.

Page 26: Clique and sting

Informal problem statementGiven a large set of multidimensional data

points, the data space is usually not uniformly occupied by the data points.

CLIQUE’s clusteringidentifies “crowded”areas in space

the sparseand the (orunits),

therebydiscovering the overall distribution

patterns of thedata set.

A unit is dense if the fraction of total data points contained in it exceeds an input model parameter.

In CLIQUE, a cluster is defined as a maximal set of connected dense units.

Page 27: Clique and sting

Formal Problem StatementLet A= {A1, A2, . . . , Ad } be a set of

bounded, totally ordered domains and S = A1× A2× · · · × Ad a d- dimensional numerical space.

We will refer to A1, . . . , Ad as the dimensions (attributes) of S.

The input consists of a set of d-dimensional points V ={v1, v2, . . . , vm}

Where vi = vi1, vi2, . . . , vid . The j th component of vi is drawn from domain Aj .

Page 28: Clique and sting

28

The CLIQUE Algorithm (cont.)3. Minimal description of clusters

The minimal description of a cluster C, produced by the above procedure, is the minimum possible union of hyperrectangular regions.

For example • A B is the minimum cluster description of the shaded

region.• C D E is a non-minimal cluster description of the same

region.

Page 29: Clique and sting

Clique Working2 Step Process

1st step – Partitioning the d- dimensional data space

2nd step- Generates the minimal description of each cluster.

Page 30: Clique and sting

1st step- PartitioningPartitioning is done for each

dimension.

Page 31: Clique and sting

Example continue….

Page 32: Clique and sting

continue….The subspaces representing these dense

units are intersected to form a candidate search space in which dense units of higher dimensionality may exist.

This approach of selecting candidates is quite similar to Apiori Gen process of generating candidates.

Here it is expected that if some thing is dense in higher dimensional space it cant be sparse in lower dimension state.

Page 33: Clique and sting

More formallyIf a k-dimensional unit is dense, then so are its

projectionsin (k-1)-dimensional space.

Given a k-dimensional candidate dense unit, if any of it’s (k-1)th projection unit is not dense then kth dimensional unit cannot be dense

So,we can generate candidate dense units in k-dimensional space from the dense units found in (k-1)-dimensional space

The resulting space searched is much smaller than theoriginal space.

The dense units are then examined in order to determine the clusters.

Page 34: Clique and sting

Intersection

Dense units found with respect to age for the dimensions salary and vacation are intersected in

order to provide a candidate search space for dense units of higher dimensionality.

Page 35: Clique and sting

2nd stage- Minimal DescriptionFor each cluster, Clique determines the

maximalregion that covers the cluster of connected dense units.

It then determines a minimal cover (logic description) for each cluster.

Page 36: Clique and sting

Effectiveness of Clique-CLIQUE automatically finds subspaces of

the highest dimensionality such that high-density clusters exist in those subspaces.

It is insensitive to the order of input objectsIt scales linearly with the size of inputEasily scalable with the number of

dimensions in the data

Page 37: Clique and sting

GRID-BASED CLUSTERING METHODS

This is the approach in which we quantize space into a finite number of cells that form a grid structure on which all of the operations for clustering is performed.

So, for example assume that we have a set of records and we want to cluster with respect to two attributes, then, we divide the related space (plane), into a grid structure and then we find the clusters.

Page 38: Clique and sting

Age

Salary (10,000)

Our “space” is this plane

20 30 40 50 60

8

7

6

5

4

3

2

1

0

Page 39: Clique and sting

Techniques for Grid-Based Clustering

The following are some techniques that are used to perform Grid-Based Clustering: CLIQUE (CLustering In QUest.) STING (STatistical Information Grid.) WaveCluster

Page 40: Clique and sting

Looking at CLIQUE as an Example

CLIQUE is used for the clustering of high-dimensional data present in large tables. By high-dimensional data we mean records that have many attributes.

CLIQUE identifies the dense units in the subspaces of high dimensional data space, and uses these subspaces to provide more efficient clustering.

Page 41: Clique and sting

How Does CLIQUE Work?

Let us say that we have a set of records that we would like to cluster in terms of n-attributes.

So, we are dealing with an n-dimensional space.

MAJOR STEPS : CLIQUE partitions each subspace that has

dimension 1 into the same number of equal length intervals.

Using this as basis, it partitions the n-dimensional data space into non-overlapping rectangular units.

Page 42: Clique and sting

CLIQUE: Major Steps (Cont.) Now CLIQUE’S goal is to identify the dense n-

dimensional units. It does this in the following way: CLIQUE finds dense units of higher

dimensionality by finding the dense units in the subspaces.

So, for example if we are dealing with a 3-dimensional space, CLIQUE finds the dense units in the 3 related PLANES (2-dimensional subspaces.)

It then intersects the extension of the subspaces representing the dense units to form a candidate search space in which dense units of higher dimensionality would exist.

Page 43: Clique and sting

CLIQUE: Major Steps. (Cont.)

Each maximal set of connected dense units is considered a cluster.

Using this definition, the dense units in the subspaces are examined in order to find clusters in the subspaces.

The information of the subspaces is then used to find clusters in the n-dimensional space.

It must be noted that all cluster boundaries are either horizontal or vertical. This is due to the nature of the rectangular grid cells.

Page 44: Clique and sting

Example for CLIQUE

Let us say that we want to cluster a set of records that have three attributes, namely, salary, vacation and age.

The data space for the this data would be 3-dimensional.

age

salary

vacation

Page 45: Clique and sting

Example (Cont.)

After plotting the data objects, each dimension, (i.e., salary, vacation and age) is split into intervals of equal length.

Then we form a 3-dimensional grid on the space, each unit of which would be a 3-D rectangle.

Now, our goal is to find the dense 3-D rectangular units.

Page 46: Clique and sting

Example (Cont.)

To do this, we find the dense units of the subspaces of this 3-d space.

So, we find the dense units with respect to age for salary. This means that we look at the salary-age plane and find all the 2-D rectangular units that are dense.

We also find the dense 2-D rectangular units for the vacation-age plane.

Page 47: Clique and sting

Example 1 Sa

lary

(1

0,00

0)

20 30 40 50 60age

54

31

26

70

20 30 40 50 60age

54

31

26

70

Vac

atio

n(w

eek)

20 30 40 50 60age

54

31

26

70

Vac

atio

n(w

eek)

Page 48: Clique and sting

Example (Cont.)

Now let us try to visualize the dense units of the two planes on the following 3-d figure :

age

Vac

atio

n

Salary 30 50

age

Vac

atio

n

Salary 30 50

= 3

Page 49: Clique and sting

Example (Cont.)

We can extend the dense areas in the vacation-age plane inwards.

We can extend the dense areas in the salary-age plane upwards.

The intersection of these two spaces would give us a candidate search space in which 3-dimensional dense units exist.

We then find the dense units in the salary-vacation plane and we form an extension of the subspace that represents these dense units.

Page 50: Clique and sting

Example (Cont.)

Now, we perform an intersection of the candidate search space with the extension of the dense units of the salary-vacation plane, in order to get all the 3-d dense units.

So, What was the main idea? We used the dense units in

subspaces in order to find the dense units in the 3-dimensional space.

After finding the dense units, it is very easy to find clusters.

Page 51: Clique and sting

Reflecting upon CLIQUE

Why does CLIQUE confine its search for dense units in high dimensions to the intersection of dense units in subspaces?

Because the Apriori property employs prior knowledge of the items in the search space so that portions of the space can be pruned.

The property for CLIQUE says that if a k-dimensional unit is dense then so are its projections in the (k-1) dimensional space.

Page 52: Clique and sting

Strength and Weakness of CLIQUE Strength

It automatically finds subspaces of the highest dimensionality such that high density clusters exist in those subspaces.

It is quite efficient. It is insensitive to the order of records in input and

does not presume some canonical data distribution. It scales linearly with the size of input and has good

scalability as the number of dimensions in the data increases.

Weakness The accuracy of the clustering result may be

degraded at the expense of simplicity of the simplicity of this method.

Page 53: Clique and sting

CLIQUE: The Major Steps• Partition the data space and find the number of points

that lie inside each cell of the partition.• Identify the subspaces that contain clusters using the

Apriori principle• Identify clusters:

– Determine dense units in all subspaces of interests– Determine connected dense units in all subspaces of

interests.

• Generate minimal description for the clusters– Determine maximal regions that cover a cluster of connected

dense units for each cluster– Determination of minimal cover for each cluster

Page 54: Clique and sting

Sala

ry

(10,

000)

20 30 40 50 60age

54

31

26

70

20 30 40 50 60age

54

31

26

70

Vaca

tion

(wee

k)age

Vaca

tion

Salary 30 50

= 3

Page 55: Clique and sting

Strength and Weakness of CLIQUE

• Strength – It automatically finds subspaces of the highest

dimensionality such that high density clusters exist in those subspaces

– It is insensitive to the order of records in input and does not presume some canonical data distribution

– It scales linearly with the size of input and has good scalability as the number of dimensions in the data increases

• Weakness– The accuracy of the clustering result may be degraded at

the expense of simplicity of the method

Page 56: Clique and sting

Global Dimensionality Reduction (GDR)

First PrincipalComponent (PC) First PC

• works well only when data is globally correlated• otherwise too many false positives result in high query

cost• solution: find local correlations instead of global

correlation

Page 57: Clique and sting

Local Dimensionality Reduction (LDR)

First PC

GDR LDRFirst PC of Cluster1

Cluster1

Cluster2

First PC of Cluster2

Page 58: Clique and sting

Correlated Cluster

Second PC(eliminated dim.)

Centroid of cluster (projection of mean on eliminated dim)

First PC(retained dim.)

Mean of all points in cluster

A set of locally correlated points = <PCs, subspace dim, centroid, points>

Page 59: Clique and sting

Reconstruction Distance

Centroid of cluster

First PC(retained dim)

Second PC(eliminated dim)

Point QProjection of Q on eliminated dim

ReconstructionDistance(Q,S)

Page 60: Clique and sting

Reconstruction Distance Bound

Centroid

First PC(retained dim)

Second PC(eliminated dim)

£ MaxReconDist

£ MaxReconDist

ReconDist(P, S) £ MaxReconDist, " P in S

Page 61: Clique and sting

Other constraints

• Dimensionality bound: A cluster must not retain any more dimensions necessary and subspace dimensionality £ MaxDim

• Size bound: number of points in the cluster ³ MinSize

Page 62: Clique and sting

Clustering Algorithm Step 1: Construct Spatial Clusters

• Choose a set of well-scattered points as centroids (piercing set) from random sample

• Group each point P in the dataset with its closest centroid C if the Dist(P,C) £

Page 63: Clique and sting

Clustering Algorithm Step 2: Choose PCs for each cluster

• Compute PCs

Page 64: Clique and sting

Clustering AlgorithmStep 3: Compute Subspace Dimensionality

0

0.2

0.4

0.6

0.8

1

0 2 4 6 8 10 12 14 16

#dims retained

Frac

poi

nts

obey

ing

reco

ns. b

ound

• Assign each point to cluster that needs min dim. to accommodate it

• Subspace dim. for each cluster is the min # dims to retain to keep most points

Page 65: Clique and sting

Clustering Algorithm Step 4: Recluster points

• Assign each point P to the cluster S such that ReconDist(P,S) £ MaxReconDist

• If multiple such clusters, assign to first cluster (overcomes “splitting” problem)

Emptyclusters

Page 66: Clique and sting

Clustering algorithmStep 5: Map points

• Eliminate small clusters

• Map each point to subspace (also store reconstruction dist.)

Map

Page 67: Clique and sting

Clustering algorithmStep 6: Iterate

• Iterate for more clusters as long as new clusters are being found among outliers

• Overall Complexity: 3 passes, O(ND2K)

Page 68: Clique and sting

Experiments (Part 1)• Precision Experiments:

– Compare information loss in GDR and LDR for same reduced dimensionality

– Precision = |Orig. Space Result|/|Reduced Space Result| (for range queries)

– Note: precision measures efficiency, not answer quality

Page 69: Clique and sting

Datasets• Synthetic dataset:

– 64-d data, 100,000 points, generates clusters in different subspaces (cluster sizes and subspace dimensionalities follow Zipf distribution), contains noise

• Real dataset:– 64-d data (8X8 color histograms extracted from 70,000

images in Corel collection), available at http://kdd.ics.uci.edu/databases/CorelFeatures

Page 70: Clique and sting

Precision Experiments (1)

0

0.5

1

Prec.

0 0.5 1 2

Skew in cluster size

Sensitivity of prec. to skew

LDR GDR

0

0.5

1

Prec.

1 2 5 10

Number of clusters

Sensitivity of prec. to num clus

LDR GDR

Page 71: Clique and sting

Precision Experiments (2)

0

0.5

1

Prec.

0 0.02 0.05 0.1 0.2

Degree of Correlation

Sensitivity of prec. to correlation

LDR GDR

0

0.5

1

Prec.

7 10 12 14 23 42

Reduced dim

Sensitivity of prec. to reduced dim

LDR GDR

Page 72: Clique and sting

Index structureRoot containing pointers to root of each cluster index (also stores PCs and subspace dim.)

Index on

Cluster 1

Index on

Cluster K

Set of outliers (no index: sequential scan)

Properties: (1) disk based (2) height £ 1 + height(original space index) (3) almost balanced

Page 73: Clique and sting

Cluster Indices• For each cluster S, multidimensional index on (d+1)-dimensional space instead

of d-dimensional space:– NewImage(P,S)[j] = projection of P along jth PC for 1 £ j £ d

= ReconDist(P,S) for j= d+1

• Better estimate: D(NewImage(P,S), NewImage(Q,S)) ³

D(Image(P,S), Image(Q,S))

• Correctness: Lower Bounding Lemma D(NewImage(P,S), NewImage(Q,S)) £ D(P,Q)

Page 74: Clique and sting

Effect of Extra dimension

I/O cost

0200400600800

1000

12 14 15 17 19 30 34

Reduced dimensionality

# ra

nd d

isk

acce

sses

d-dim(d+1)-dim

Page 75: Clique and sting

Outlier Index

• Retain all dimensions• May build an index, else use sequential

scan (we use sequential scan for our experiments)

Page 76: Clique and sting

Query Support• Correctness:

– Query result same as original space index

• Point query, Range Query, k-NN query– similar to algorithms in multidimensional index structures– see paper for details

• Dynamic insertions and deletions– see paper for details

Page 77: Clique and sting

Experiments (Part 2)• Cost Experiments:

– Compare linear scan, Original Space Index(OSI), GDR and LDR in terms of I/O and CPU costs. We used hybrid tree index structure for OSI, GDR and LDR.

• Cost Formulae:– Linear Scan: I/O cost (#rand accesses)=file_size/10, CPU cost– OSI: I/O cost=num index nodes visited, CPU cost– GDR: I/O cost=index cost+post processing cost (to eliminate false

positives), CPU cost– LDR: I/O cost=index cost+post processing cost+outlier_file_size/10,

CPU cost

Page 78: Clique and sting

I/O Cost (#random disk accesses)

I/O cost comparison

0500

10001500200025003000

7 10 12 14 23 42 50 60

Reduced dim

#rand disk acc

LDRGDROSILin Scan

Page 79: Clique and sting

CPU Cost (only computation time)

CPU cost comparison

02040

6080

7 10 12 14 23 42

Reduced dim

CPU cost (sec)

LDRGDROSILin Scan

Page 80: Clique and sting

Conclusion• LDR is a powerful dimensionality reduction technique

for high dimensional data– reduces dimensionality with lower loss in distance

information compared to GDR– achieves significantly lower query cost compared to linear

scan, original space index and GDR

• LDR has applications beyond high dimensional indexing

Page 81: Clique and sting

05/03/2023 CLIQUE clustering algorithm 81

Motivation

An object typically has dozens of attributes, the domain for each attribute can be large

Require the user to specify the subspace for cluster analysis

User-identification of subspaces is quite error-prone.

Page 82: Clique and sting

05/03/2023 CLIQUE clustering algorithm 82

The Contribution of CLIQUE

Automatically find subspaces with high-density clusters in high dimensional attribute space

Page 83: Clique and sting

05/03/2023 CLIQUE clustering algorithm 83

Background

A1, A2, ﹍ , Ad is the dimensions of S : A = {A1, A2, ﹍ , Ad} S = A1 × A2 × ﹍ × Ad

units : Partition every dimension into ξ intervals of equal

length unit u: {u1, u2, ﹍ , ud} where ui = [ li, hi )

Page 84: Clique and sting

05/03/2023 CLIQUE clustering algorithm 84

Background(Cont.)

Selectivity: the fraction of total data points contained in the unit

Dense unit: selectivity(u) >

Cluster: a maximal set of connected dense units

Page 85: Clique and sting

05/03/2023 CLIQUE clustering algorithm 85

1

Example

Page 86: Clique and sting

05/03/2023 CLIQUE clustering algorithm 86

Background(Cont.) region: axis-parallel rectangular set

RC=R: R is contained in C

maximal region: no proper superset of R is contained in C

minimal description: a non-redundant covering of the cluster with maximal regions

Page 87: Clique and sting

05/03/2023 CLIQUE clustering algorithm 87

Example

((30£age50)(4 £salary8)) ((40£age60)(2 £salary6))2

Page 88: Clique and sting

05/03/2023 CLIQUE clustering algorithm 88

CLIQUE Algorithm

1. Identification of dense units

2. Identification of clusters.

3. Generation of minimal description

Page 89: Clique and sting

05/03/2023 CLIQUE clustering algorithm 89

Identification of dense units

bottom-up algorithm : like Apriori algorithm

Monotonicity : If a collection of points S is a cluster in a k-dimensional

space, then S is also part of a cluster in any (k–1)-dimensional projections of this space.

Page 90: Clique and sting

05/03/2023 CLIQUE clustering algorithm 90

Algorithm

1. determine 1-dimensional dense units2. k = 23. generate candidate k-dimensional units from (k-1)-dimensional dense units4. if candidates are not empty

find dense unitsk = k + 1go to step 3

Page 91: Clique and sting

05/03/2023 CLIQUE clustering algorithm 91

Algorithm - Candidate Generation

Self-joininginsert into Ck

select u1.[l1, h1), u1.[l2, h2), ﹍ , u1.[lk-1, hk-1), u2.[lk-1, hk-1)from Dk-1 u1, Dk-1 u2

where u1.a1 = u2.a1, u1.l1 = u2.l1, u2.h1 = u2.h1, u1.a2 = u2.a2, u1.l2 = u2.l2, u2.h2 = u2.h2, ﹍ , u1.ak-2 = u2.ak-2, u1.lk-2 = u2.lk-2, u2.hk-2 = u2.hk-2, u1.ak-1 < u2.ak-1

Pruning

Page 92: Clique and sting

05/03/2023 CLIQUE clustering algorithm 92

Page 93: Clique and sting

05/03/2023 CLIQUE clustering algorithm 93

Prune subspaces

Objective: use only the dense units that lie in “interesting” subspaces

MDL principle: encode the input data under a given model and

select the encoding that minimizes the code length.

Page 94: Clique and sting

05/03/2023 CLIQUE clustering algorithm 94

Prune subspaces (Cont.)

Group together dense units in the same subspace Compute the number of points for each subspace

Sort subspaces in the descending order of their coverage Minimize the total length of the encoding

|))((|log))((log

|))((|log))((log)(

12

12

ixi

ixiiCL

Pnji

SP

Iij

SI

j

j

££

££

2

2

jij Su iS ucountx )(

Page 95: Clique and sting

05/03/2023 CLIQUE clustering algorithm 95

Prune subspaces (Cont.)

Partitioning of the subspaces into selected and pruned sets

Page 96: Clique and sting

05/03/2023 CLIQUE clustering algorithm 96

Finding Clusters

Page 97: Clique and sting

05/03/2023 CLIQUE clustering algorithm 97

Generating minimal cluster descriptions

R is a cover of C

optimal cover: NP-hard

solution to the problem : greedily cover the cluster by a number of maximal

regions discard the redundant regions

Page 98: Clique and sting

05/03/2023 CLIQUE clustering algorithm 98

Greedy growth

1) begin with an arbitrary dense unit u C

2) Greedily grow a maximal region covering u, add to R

3) repeat 2) with all uk C are covered by some maximal regions in R

Page 99: Clique and sting

05/03/2023 CLIQUE clustering algorithm 99

Minimal Cover

Remove from the cover the smallest maximal region which is redundant.

Repeat the procedure until no maximal region can be removed.

Page 100: Clique and sting

100

Performance Experiments

Page 101: Clique and sting

05/03/2023 CLIQUE clustering algorithm 101

Comparison with Birch, DBScan

Concludes that CLIQUE performs better than Birch, DBScan

Page 102: Clique and sting

05/03/2023 CLIQUE clustering algorithm 102

Real data experimental result datasets :

insurance industry (Insur1, Insur2)

department store (Store) bank (Bank) In all cases, we discovered

meaningful clusters embedded in lower dimensional subspaces.

Page 103: Clique and sting

05/03/2023 CLIQUE clustering algorithm 103

Strength automatically finds clusters in subspaces insensitive to the order of records not presume some canonical data

distribution scales linearly with the size of input tolerant of missing values

Page 104: Clique and sting

05/03/2023 CLIQUE clustering algorithm 104

Weakness depends on some parameters that hard to

pre-select ξ (partition threshold) (density threshold)

some potential clusters will be lost in the density-units prune procedures.

the correctness of the algorithm degrades

Page 105: Clique and sting

What or who is STING?

A singer who was the lead singer of the band Police and then took up solo career and won many grammy’s.

The bite of a scorpion. A Statistical

Information Grid Approach to Spatial Data Mining.

All of the above.

Page 106: Clique and sting

What is Spatial Data?Many definitions according to specific areasAccording to GIS Spatial data may be thought of as features

located on or referenced to the Earth's surface, such as roads, streams, political boundaries, schools, land use classifications, property ownership parcels, drinking water intakes, pollution discharge sites - in short, anything that can be mapped.

Geographic features are stored as a series of coordinate values. Each point along a road or other feature is defined by positional coordinate value, such as longitude and latitude.

The GIS stores and manages the data not as a map but as a series of layers or, as they are sometimes called, themes

When viewed in a GIS, these layersvisually appear as one graphic, but areactually still independent of each other.This allows changes to specific themes, without affecting the others.

Discussion Question 1: So can you define spatial Data Generically????

Page 107: Clique and sting

•Spatial database systems aim at storing, retrieving, manipulating, querying, and analyzing geometric data.

•Special data types are necessary to model geometry and to suitably represent geometric data in database systems. These data types are usually called spatial data types, such as point, line, and region but also include more complex types like partitions and graphs (networks).

•Data Type understanding is a prerequisite for an effective construction of important components of a spatial database system (like spatial index structures, optimizers for spatial data, spatial query languages, storage management, and graphical user interfaces) and for a cooperation with extensible DBMS providing spatial type extension packages (like spatial data blades and cartridges).

•Excellent tutorial on spatial data and data types available at: http://www.informatik.fernuni-hagen.de/import/pi4/schneider/abstracts/TutorialSDT.html

What are Spatial Databases?

Page 108: Clique and sting
Page 109: Clique and sting

Different Grid Levels during Query Processing.

Page 111: Clique and sting

Spatial Data Mining Discovery of interesting characteristics

and patterns that may implicitly exist in spatial databases.

Huge amount of data specialized in nature. Clustering and region oriented queries are

common problems in this domain. We deal with high dimensional data

generally. Applications: GIS, Medical Imaging etc.

Page 112: Clique and sting

•Huge Amount of Data Specialized in Nature

Problems????????

•Complexity•Defining of geometric patterns and region oriented queries•Conceptual nature of problem!•Spatial Data Accessing

Page 113: Clique and sting

STING-An Introduction

•STING is a grid based method to efficiently process many common region oriented queries on a set of points

•What defines region? You tell me! Essentially it is a set of points satisfying some criterion

•It is a hierarchical Method. The idea is to capture statistical information associated with spatial cells in such a manner that the whole classes of queries can be answered without referring to the individual objects.

•Complexity is hence even less than O(n) infact what do you think it will be???

•Link to Paper: http://citeseer.nj.nec.com/wang97sting.html

Page 114: Clique and sting

Related Work

Spatial Data Mining

Generalization Based Knowledge Discovery Clustering Based Methods

CLARANS BIRCH DBSCANSpatial Data Dominant Non-Spatial Data Dominant

Great comparison of Clustering algorithmshttp://www.cs.ualberta.ca/~joerg/papers/KI-Journal.pdf

Page 115: Clique and sting

Generalization Based Approaches

Two types: Spatial Data Dominant and Non- Spatial Data Dominant

Both of these require that a generalization hierarchy is given explicitly by experts or is somehow generated automatically.

Quality of mined data depends on the structure of the hierarchy.

Computational Complexity O(nlogn) So the onus shifted to developing algorithms

which discover characteristics directly from data. This was the motivation to move to clustering algorithms

Page 116: Clique and sting

Clustering Based Approaches

BIRCH: Already covered Remember it?? Complexity??

The problem with BIRCH is that it does not work well with clusters which are not spherical.

DBSCAN: Already covered Remember it?? Complexity??

The Global Parameter Eps determination in DBSCAN requires human participation

When the point set to be clustered is the response set of objects with some qualifications, then determination of Eps must be done each time and cost is hence higher.

Page 117: Clique and sting

Clustering Based Approaches

CLARANS: Clustering Large Applications based upon RANdomized Search.

Although claims have been made on it being linear it is essentially quadratic.

The computational Complexity is at least Ώ(KN2) where N is the number of data point and K is the number of clusters.

Quality of results can not be guaranteed when N is large as we use Randomized Search

Optimization with Randomized Search Heuristics The (A)NFL Theorem, Realistic Scenarios, and Dicult Functions

Page 118: Clique and sting

Related Work

All the approaches described in previous slides

are all query dependent approaches

The structure of queries influence the structure of

the algorithm and cannot be generalized to all

queries.

As they scan all the data points the complexity

will at least be O(N)

Page 119: Clique and sting

STING THE OVERVIEW

Spatial Area is divided into rectangular cells Different levels of cells corresponding to different

resolution and these cells have a hierarchical structure. Each cell at a higher level is partitioned into number of

cells of the next lower level Statistical information of each cell is calculated and stored

beforehand and is used to answer queries

Page 120: Clique and sting

GRID CELL HIERARCHY

Each Cell at (i-1)th level has 4

children at ith level (can be

changed)

The size of leaf cell is dependent

on the density of objects.

Generally it should be from several

dozens to thousands

Page 121: Clique and sting

For each cell we have attribute-dependent and attribute-independent parameters

The attribute independent parameter is number of objects in a cell-n

For attribute dependent parameters it is assumed that for each object its attributes have numerical values.

For each Numerical attribute we have the following five parameters

GRID CELL HIERARCHY

Page 122: Clique and sting

GRID CELL HIERARCHY

m- mean of all values in this cell s- standard deviation of all values in

this cell min-the minimum value of the

attribute in this cell max-the minimum value of the

attribute in this cell distribution-the type of distribution

this cell follows. (This is of enumeration type)

Page 123: Clique and sting

Parameter Generation

•The determination of dist parameter is as follows•First the dist is set to distribution followed by most point•An estimate is made on number of conflicting points confl according to following Rules

1) if disti is not equal to dist, m=mi and s=si then confl is increased by amount ni.2) if disti is not equal to dist, m=mi or s=si but not both then confl is set to n.3) if disti=dist and m=mi and s=si then confl is not changed4) if disti = dist, m=mi or s=si but not both then confl is set to n.

Finally if confl/n is greater than a threshold (say 0.05) then dist is set to none orOriginal dist is retained

Page 124: Clique and sting

Parameter Generation

i 1 2 3 4ni 100 50 60 10mi 20.1 19.7 21 20.5si 2.3 2.2 2.4 2.1mini 4.5 5.5 3.8 7max

i 36 34 37 40

distiNorm

alNorm

alNorm

alNon

e

The parameters of the current cell areN=220m=20.27s=2.37min=3.8max=40dist=NORMAL

This is so because there are 220 data points out of which 10 are not NORMALSo confl/n=10/220=0.045<0.05 hence it is still NORMAL.

The parameters are calculated only once so overall compilation time is O(N)

But querying requires much less time as we only scan the number of grid cellsK i.e. O(K)

Page 125: Clique and sting

Query Types

If hierarchical structure cannot answer a query then can go to underlying database

SQL like Language used to describe queries

Two types of common queries found: one is to find region specifying certain constraints and other take in a region and return some attribute of the region

Page 126: Clique and sting

Query Type: Examples

Page 127: Clique and sting

Algorithm Top down querying. Examine cells at a higher level determine

if the cell is relevant to query at some confidence level. This likelihood can be defined as the proportion of objects in this cell that satisfy the query conditions. After obtaining the confidence interval, we label this cell to be relevant or not relevant at the specified confidence level.

After doing so for the present layer process is repeated for the

children cells of the RELEVANT cells in the present layer

only!!!

Procedure continues till the bottom most layer

Find region formed by relevant cells and return them If not satisfactory retrieve those data that fall into the

relevant cells from database and do some further processing.

Page 128: Clique and sting

After all cells are labeled as relevant or not relevant, we can easily find all regions that satisfy the density specified by Breadth First Search.

For a relevant cell, we examine cells within a certain distance d from the center of the current cell to see if the average density within this small area is greater than density specified.

If yes the cells are put into a queue Step 2 and 3 are repeated for all the cells in

the queue except cells previously examined are omitted.

When the queue is empty we get one region.

Algorithm

Page 129: Clique and sting

The distance d =max (l, √(f/c∏) l, c, f are the side length of bottom

layer cell, the specified density and small constant number set by STING (does not vary from query to another)

L is usually the dominant term so we generally have to examine the neighborhood term. If only granularity is very small do we need examine very cell at that distance rather than just the neighborhood.

Algorithm

Page 130: Clique and sting

ExampleGiven Data: Houses one of the attribute is priceQuery:“Find those regions with area at least A where the number of houses per unit area is at least c and at least b% of the houses have price between a and b with(1 - a) confidence” where a < b. Here, a could be -æ and b could be +æ. This query can be written as

We begin from the top level working our way down. Assume the dist type is NORMALFirst we calculate the proportion of houses whose price lies between [a,b]The probability that price lies between a and b is

m and s are mean and standard deviation of all prices.

Page 131: Clique and sting

Example Now as we assume prices to be independent of m and s

the number of houses with price range [a, b] has a binomial distribution with parameters n and p where n is number of houses. Now we consider the following cases according to n, np and n(1-p)

a) n<=30: binomial distribution used to determine confidence interval of the number of houses whose prices fall into [a, b], and divide it by n to get the confidence interval for the proportion.b) When n > 30, n p ³ 5, and n(1 - p ) ³ 5, the proportion that the price falls in [a, b] has a normal distribution Then 100(1 - alpha)% confidence interval of the proportion is c) When n>30 but np<5 , the Poissons distribution with parameters is used for approximation. d) When n>30 but n(1-p)<5, we can calculate the proportion of houses (X) whose price is not in [a,b] using Poissons distribution with n(1-p) and 1-X is the proportion of houses whose prices is in [a,b].

Page 132: Clique and sting

Example

Once we have the confidence interval or the estimated range [p1, p2], we can label this cell as relevant or not relevant.

Let S be area of cells at bottom layer. If p1xn<Sxcx %, we can label as not relevant otherwise as relevant

Page 133: Clique and sting

Analysis of STING

Step one takes constant time Step 2 and 3 total time is proportional

to the total number of cells in the hierarchy.

Total number of cells is 1.33K, where K is number of cells at bottom layer.

In all cases it is found or claimed to be O(K)

Discussion Question: what is the complexity if we need to go to step 7 in the algorithm??

Page 134: Clique and sting

Quality

STING under the following sufficient condition guarantee that if a region satisfies the specification of the query then it is returned.

Let F be a region. The width of F is defined as the side length of the maximum square that can fit in F.

Page 135: Clique and sting

Limiting Behavior of STING

The regions returned by Sting are an approximation of the result by DBSCAN. As the granularity approaches zero the regions returned by STING approaches result of DBSCAN.

SO worst case complexity is O(nlogn)!!!!!

Page 136: Clique and sting

Performance measureCase A: Normal DistributionQuery in e.g. answered in 0.2 secStructure generation: 9.8 second

Case A: None Query in e.g. answered in 0.22 secStructure generation: 9.7 second

Page 137: Clique and sting

Performance measure

Used a benchmark called SEQUOLA 2000 to compare STING, DBSCAN, CLARANS

All the previous algorithms have three phases in query answering

1. Find Query Response2. Build auxiliary structure3. Do clustering STING does all of this in one step so is inherently

better.

Page 138: Clique and sting

Discussion Question

“STING is trivially parallelizable.” Comment why and what is the importance of this statement?

Page 139: Clique and sting

References STING : Statistical Information Grid approach to spatial data

mining. Wei Wang et al. Optimization with Randomized Search Heuristics The (A)NFL

Theorem, Realistic Scenarios, and Dicult Functions. Stefan Droste et al.

Efficient and Effective clustering Method for spatial data mining. R. Ng et al.

BIRCH: An efficient data clustering method for very large databases. T Zhang et al.

Tutorial on Spatial data types: http://www.informatik.fernuni-hagen.de/import/pi4/schneider/abstracts/TutorialSDT.html

An efficient Approach to Clustering in Large Multimedia Databases with Noise. A Hinneburg et al.

Comparison of clustering algorithms : http://www.cs.ualberta.ca/~joerg/papers/KI-Journal.pdf

Page 140: Clique and sting

MotivationAll previous clustering algorithm are query

dependentThey are built for one query and generally

no use for other query.Need a separate scan for each query.So computation more complex at least

O(n).So we need a structure out of Database so

that various queries can be answered without rescanning.

Page 141: Clique and sting

BasicsGrid based method-quantizes the object space

into a finite number of cells that form a grid structure on which all of the operations for clustering are performed

Develop hierarchical Structure out of given data and answer various queries efficiently.

Every level of hierarchy consist of cellsAnswering a query is not O(n) where n is

the number of elements in the database

Page 142: Clique and sting

A hierarchical structure for STING clustering

Page 143: Clique and sting

continue …..

The root of the hierarchy be at level 1Cell in level i corresponds to the union of

the areas of its children at level i + 1Cell at a higher level is partitioned to form a

number of cells of the next lower levelStatistical information of each cell is

calculated and stored beforehand and is used to answer queries

Page 144: Clique and sting

Cell parameterAttribute Independent parameter-n- number of objects (points) in this cell

Attribute dependent parameters-m - mean of all values in this cells - standard deviation of all values of the attribute in thiscellmin - the minimum value of the attribute in this cellmax - the maximum value of the attribute in this cell distribution - the type of distribution that the attribute value in this cell follows

Page 145: Clique and sting

Parameter Generationn, m, s, min, and max of bottom level

cells arecalculated directly from data

Distribution can be either assigned by user or can be obtained by hypothetical tests- χ2 test

Parameters of higher level cells is calculated from parameter of lower level cells.

Page 146: Clique and sting

continue…..n, m, s, min, max, dist be parameters of

current cellni, mi, si, mini, maxi and disti be

parameters of corresponding lower level cells

Page 147: Clique and sting

dist for Parent CellSet dist as the distribution type followed by most

points inthis cell

Now check for conflicting points in the child cells call itconfl.

1.If disti ≠ dist, mi ≈ m and si ≈ s, then confl is increased by anamount of ni;

2.If disti ≠ dist, but either mi ≈ m or si ≈ s is not satisfied, then set confl to n

3.If disti = dist, mi ≈ m and si ≈ s, then confl is increased by 0;

4.If disti = dist, but either mi ≈ m or si ≈ s is not satisfied, then confl is set to n.

Page 148: Clique and sting

continue…..I

fis greater than a threshold t set dist as NONE.Other wise keep the

original type.Example :

Page 149: Clique and sting

continue…..Parameterfor parent cell

would ben = 220min = 3.8

m = 20.27max = 40

s = 2.37dist = NORMAL

210 points whose distribution type is NORMAL

Set dist of parent as Normalconfl = 10 = 0.045 < 0.05 so keep the

original.

Page 150: Clique and sting

Query typesSTING structure is capableof answering

various queriesBut if it doesn’t then we always have the

underlying DatabaseEven if statistical information is not sufficient

to answer queries we can still generate possible set of answers.

Page 151: Clique and sting

Common queriesSelect regions that satisfy certain

conditionsSelect the maximal regions that have at least 100 houses per unit area and at least 70% of the house prices are above $400K and with totalareaat least 100 units with 90% confidence

SELECT REGIONFROM house-mapWHERE DENSITY IN (100, ∞)AND price RANGE (400000, ∞) WITH PERCENT (0.7, 1) AND AREA (100, ∞)AND WITH CONFIDENCE 0.9

Page 152: Clique and sting

continue….Selects regions and returns some function of

the regionSelect the range of age of houses in those maximal regionswhere there areat least 100 houses perunit areaand at least 70% of the houses have price between $150Kand $300K witharea at least 100 units in California.

SELECT RANGE(age)FROM house-mapWHERE DENSITY IN (100, ∞)AND price RANGE (150000, 300000) WITH PERCENT (0.7, 1) AND AREA (100, ∞)AND LOCATION California

Page 153: Clique and sting

AlgorithmWith the hierarchical structure of grid cells on

hand, we can use a top-down approach to answer spatial data mining queries

For any query, we begin by examining cells on a high level layer

calculate the likelihood that this cell is relevant to the query at some confidence level using the parameters of this cell

If the distribution type is NONE, we estimate the likelihood using some distribution free techniques instead

Page 154: Clique and sting

continue….After we obtain the confidence interval, we

label this cell to be relevant or not relevant at the specified confidence level

Proceed to the next layer but only consider the Childs of relevant cells of upper layer

We repeat this until we reach to the final layer

Relevant cells of final layer have enough statistical information to give satisfactory result to query.

However for accurate mining we may refer to data corresponding to relevant cells and further process it.

Page 155: Clique and sting

Finding regionsAfter we have got all the relevant cells at the

final level we need to output regions that satisfies the query

We can do it using Breadth First Search

Page 156: Clique and sting

Breadth First Searchwe examine cells within a certain

distance from the center of current cellIf the average density within this small

area isgreater than the density specified mark this area

Put the relevant cells just examined in the queue.

Take element from queue repeat the same procedure except that only those relevant cells that are not examined before are enqueued. When queue is empty we have identified one region.

Page 157: Clique and sting

Statistical Information Grid-based Algorithm1.Determine a layer to begin with.2.For eachcell of this layer, we calculate the

confidence interval (or estimated range) of probability that this cell is relevant to the query.

3.From the interval calculated above, we label the cell as relevant or not relevant.

4.If this layer is the bottom layer, go to Step 6; otherwise, go to Step 5.

5.We go down the hierarchy structure by one level. Go to Step 2 for those cells that form the relevant cells of the higher level layer.

6.If the specification of the query is met, go to Step 8; otherwise, go to Step 7.

7.Retrieve those data fall into the relevant cells and do further processing. Return the result that meet the requirement of the query. Go to Step 9.

8.Find the regions of relevant cells. Return those regions that meet the requirement of the query. Go to Step 9.

9.Stop.

Page 158: Clique and sting

Time Analysis:Step 1 takes constant time. Steps 2 and 3

requireconstant time.

The total time is less than or equal to the total number of cells in our hierarchical structure.

Notice that the total number of cells is 1.33K, where K is the number of cells at bottom layer.

So the overall computation complexity on the grid hierarchy structure is O(K)

Page 159: Clique and sting

Time Analysis:STING goes through the database once to

compute thestatistical parameters of the cells

time complexity of generating clusters is O(n), where n is the total number of objects.

After generating the hierarchical structure, the query processing time is O(g), where g is the total number of grid cells at the lowest level, which is usually much smaller than n.

Page 160: Clique and sting

Comparison

Page 161: Clique and sting

Definitions That Need to Be Known

Spatial Data: Data that have a spatial or location

component. These are objects that themselves are located

in physical space. Examples: My house, lake Geneva, New York

City, etc. Spatial Area:

The area that encompasses the locations of all the spatial data is called spatial area.

Page 162: Clique and sting

STING (Introduction)

STING is used for performing clustering on spatial data.

STING uses a hierarchical multi resolution grid data structure to partition the spatial area.

STINGS big benefit is that it processes many common “region oriented” queries on a set of points, efficiently.

We want to cluster the records that are in a spatial table in terms of location.

Placement of a record in a grid cell is completely determined by its physical location.

Page 163: Clique and sting

Hierarchical Structure of Each Grid Cell The spatial area is divided into

rectangular cells. (Using latitude and longitude.)

Each cell forms a hierarchical structure. This means that each cell at a higher

level is further partitioned into 4 smaller cells in the lower level.

In other words each cell at the ith level (except the leaves) has 4 children in the i+1 level.

The union of the 4 children cells would give back the parent cell in the level above them.

Page 164: Clique and sting

Hierarchical Structure of Cells (Cont.)

The size of the leaf level cells and the number of layers depends upon how much granularity the user wants.

So, Why do we have a hierarchical structure for cells?

We have them in order to provide a better granularity, or higher resolution.

Page 165: Clique and sting

A Hierarchical Structure for Sting Clustering

Page 166: Clique and sting

Statistical Parameters Stored in each Cell

For each cell in each layer we have

attribute dependent and attribute independent parameters. Attribute Independent Parameter:

Count : number of records in this cell. Attribute Dependent Parameter:

(We are assuming that our attribute values are real numbers.)

Page 167: Clique and sting

Statistical Parameters (Cont.)

For each attribute of each cell we store the following parameters:

M mean of all values of each attribute in this cell.

S Standard Deviation of all values of each attribute in this cell.

Min The minimum value for each attribute in this cell.

Max The maximum value for each attribute in this cell.

Distribution The type of distribution that the attribute value in this cell follows. (e.g. normal, exponential, etc.) None is assigned to “Distribution” if the distribution is unknown.

Page 168: Clique and sting

Storing of Statistical Parameters

Statistical information regarding the attributes in each grid cell, for each layer are pre-computed and stored before hand.

The statistical parameters for the cells in the lowest layer is computed directly from the values that are present in the table.

The Statistical parameters for the cells in all the other levels are computed from their respective children cells that are in the lower level.

Page 169: Clique and sting

How are Queries Processed ? STING can answer many queries, (especially

region queries) efficiently, because we don’t have to access full database.

How are spatial data queries processed? We use a top-down approach to answer spatial

data queries. Start from a pre-selected layer-typically with a

small number of cells. The pre-selected layer does not have to be the

top most layer. For each cell in the current layer compute the

confidence interval (or estimated range of probability) reflecting the cells relevance to the given query.

Page 170: Clique and sting

Query Processing (Cont.)

The confidence interval is calculated by using the statistical parameters of each cell.

Remove irrelevant cells from further consideration.

When finished with the current layer, proceed to the next lower level.

Processing of the next lower level examines only the remaining relevant cells.

Repeat this process until the bottom layer is reached.

Page 171: Clique and sting

Sample Query Examples Assume that the spatial area is the map of the

regions of Long Island, Brooklyn and Queens. Our records represent apartments that are

present throughout the above region. Query : “ Find all the apartments that are for

rent near Stony Brook University that have a rent range of: $800 to $1000”

The above query depend upon the parameter “near.” For our example near means within 15 miles of Stony Brook University.