clique and sting
TRANSCRIPT
CLIQUE and STING
Dr S.NatarajanProfessor and Key Resource Person
Department of Information Science and Engineering
PES Institute of TechnologyBengaluru
High-dimensional integration• High-dimensional integrals in statistics, ML, physics
• Expectations / model averaging• Marginalization• Partition function / rank models / parameter learning
• Curse of dimensionality:
• Quadrature involves weighted sum over exponential number of items (e.g., units of volume)
L L2 L3 Ln
n dimensional hypercube
L4
2
High Dimensional Indexing Techniques
• Index trees (e.g., X-tree, TV-tree, SS-tree, SR-tree, M-tree, Hybrid Tree)– Sequential scan better at high dim. (Dimensionality Curse)
• Dimensionality reduction (e.g., Principal Component Analysis (PCA)), then build index on reduced space
Datasets• Synthetic dataset:
– 64-d data, 100,000 points, generates clusters in different subspaces (cluster sizes and subspace dimensionalities follow Zipf distribution), contains noise
• Real dataset:– 64-d data (8X8 color histograms extracted from 70,000
images in Corel collection), available at http://kdd.ics.uci.edu/databases/CorelFeatures
5
Preliminaries – Nearest Neighbor Search
• Given a collection of data points and a query point in m-dimensional metric space, find the data point that is closest to the query point
• Variation: k-nearest neighbor
• Relevant to clustering and similarity search
• Applications: Geographical Information Systems, similarity search in multimedia databases
6
NN Search Con’t
Source: [2]
7
Problems with High Dimensional Data
• A point’s nearest neighbor (NN) loses meaning
Source: [2]
8
Problems Con’t
• NN query cost degrades – more strong candidates to compare with
• In as few as 10 dimensions, linear scan outperforms some multidimensional indexing structures (e.g. SS tree, R* tree, SR tree)
• Biology and genomic data can have dimensions in the 1000’s.
9
Problems Con’t
• The presence of irrelevant attributes decreases the tendency for clusters to form
• Points in high dimensional space have high degree of freedom; they could be so scattered that they appear uniformly distributed
10
Problems Con’t• In which cluster does the query point fall?
11
The Curse• Refers to the decrease in performance of query
processing when the dimensionality increases
• The focus of this talk will be on quality issues of NN search and on not performance issues
• In particular, under certain conditions, the distance between the nearest point and the query point equals the distance between the farthest and query point as dimensionality approaches infinity
12
Curse Con’t
Source: N. Katayama, S. Satoh. Distinctiveness Sensitive Nearest Neighbor Search for Efficient Similarity Retrieval of Multimedia Information. ICDE Conference, 2001.
13
Unstable NN-QueryA nearest neighbor query is unstable for a given > 0 if the distance from the query point to most data points is less than (1+) times the distance from the query point to its nearest neighbor
Source: [2]
14
Theorem Con’t
Source: [2]
15
Theorem Con’t
Source: [1]
16
Rate of Convergence
• At what dimensionality does NN-queries become unstable. Not easy to answer, so experiments were performed on real and synthetic data.
• If conditions of theorem are met, DMAXm/DMINm should decrease with
increasing dimensionality
17
Conclusions• Make sure enough contrast between query and
data points. If distance to NN is not much different from average distance, the NN may not be meaningful
• When evaluating high-dimensional indexing techniques, should use data that do not satisfy Theorem 1 and should compare with linear scan
• Meaningfulness also depends on how you describe the object that is represented by the data point (i.e., the feature vector)
18
Other Issues
• After selecting relevant attributes, the dimensionality could still be high
• Reporting cases when data does not yield any meaningful nearest neighbor, i.e. indistinctive nearest neighbors
Sudoku• How many ways to fill a valid sudoku square?
• Sum over 981 ~ 1077 possible squares (items)• w(x)=1 if it is a valid square, w(x)=0 otherwise
• Accurate solution within seconds:• 1.634×1021 vs 6.671×1021
1 2
….
?
19
MDL
Minimum Description Length Principle Occam’s razor: prefer the simplest hypothesis
Simplest hypothesis hypothesis with shortest description length
Minimum description length Prefer shortest hypothesis
LC (x) is the description length for message x under coding scheme c
1 2arg min ( ) ( | )MDL C C
h Hh L h L D h
# of bits to encode hypothesis h
# of bits to encode data D given h
Complexity of Model
# of Mistakes
MDL: Interpretation of –logP(D|H)+K(H)
Interpreting –logP(D|H)+K(H)K(H) is mimimum description length of H-logP(D|H) is the mimimum description length of D
(experimental data) given H. That is, if H perfectly explains D, then P(D|H)=1, then this term is 0. If not perfect, then this is interpreted as the number of bits needed to encode errors.
MDL: Minimum Description Length principle (J. Rissanen): given data D, the best theory for D is the theory H which minimizes the sum of Length of encoding HLength of encoding D, based on H (encoding errors)
CLIQUE: A Dimension-Growth Subspace Clustering MethodFirst dimension growth subspace clustering
algorithmClustering starts at single-dimension
subspace and move upwards towards higher dimension subspace
This algorithm can be viewed as the integrationDensity based and Grid based algorithm
CLIQUE (CLstering In QUEst)
• Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98). • Automatically identifying subspaces of a high dimensional data
space that allow better clustering than original space • CLIQUE can be considered as both density-based and grid-based
– It partitions each dimension into the same number of equal length intervals– It partitions an m-dimensional data space into non-overlapping rectangular
units– A unit is dense if the fraction of total data points contained in the unit
exceeds the input model parameter– A cluster is a maximal set of connected dense units within a subspace
Definitions That Need to Be Known
Unit : After forming a grid structure on the space, each rectangular cell is called a Unit.
Dense: A unit is dense, if the fraction of total data points contained in the unit exceeds the input model parameter.
Cluster: A cluster is defined as a maximal set of connected dense units.
Informal problem statementGiven a large set of multidimensional data
points, the data space is usually not uniformly occupied by the data points.
CLIQUE’s clusteringidentifies “crowded”areas in space
the sparseand the (orunits),
therebydiscovering the overall distribution
patterns of thedata set.
A unit is dense if the fraction of total data points contained in it exceeds an input model parameter.
In CLIQUE, a cluster is defined as a maximal set of connected dense units.
Formal Problem StatementLet A= {A1, A2, . . . , Ad } be a set of
bounded, totally ordered domains and S = A1× A2× · · · × Ad a d- dimensional numerical space.
We will refer to A1, . . . , Ad as the dimensions (attributes) of S.
The input consists of a set of d-dimensional points V ={v1, v2, . . . , vm}
Where vi = vi1, vi2, . . . , vid . The j th component of vi is drawn from domain Aj .
28
The CLIQUE Algorithm (cont.)3. Minimal description of clusters
The minimal description of a cluster C, produced by the above procedure, is the minimum possible union of hyperrectangular regions.
For example • A B is the minimum cluster description of the shaded
region.• C D E is a non-minimal cluster description of the same
region.
Clique Working2 Step Process
1st step – Partitioning the d- dimensional data space
2nd step- Generates the minimal description of each cluster.
1st step- PartitioningPartitioning is done for each
dimension.
Example continue….
continue….The subspaces representing these dense
units are intersected to form a candidate search space in which dense units of higher dimensionality may exist.
This approach of selecting candidates is quite similar to Apiori Gen process of generating candidates.
Here it is expected that if some thing is dense in higher dimensional space it cant be sparse in lower dimension state.
More formallyIf a k-dimensional unit is dense, then so are its
projectionsin (k-1)-dimensional space.
Given a k-dimensional candidate dense unit, if any of it’s (k-1)th projection unit is not dense then kth dimensional unit cannot be dense
So,we can generate candidate dense units in k-dimensional space from the dense units found in (k-1)-dimensional space
The resulting space searched is much smaller than theoriginal space.
The dense units are then examined in order to determine the clusters.
Intersection
Dense units found with respect to age for the dimensions salary and vacation are intersected in
order to provide a candidate search space for dense units of higher dimensionality.
2nd stage- Minimal DescriptionFor each cluster, Clique determines the
maximalregion that covers the cluster of connected dense units.
It then determines a minimal cover (logic description) for each cluster.
Effectiveness of Clique-CLIQUE automatically finds subspaces of
the highest dimensionality such that high-density clusters exist in those subspaces.
It is insensitive to the order of input objectsIt scales linearly with the size of inputEasily scalable with the number of
dimensions in the data
GRID-BASED CLUSTERING METHODS
This is the approach in which we quantize space into a finite number of cells that form a grid structure on which all of the operations for clustering is performed.
So, for example assume that we have a set of records and we want to cluster with respect to two attributes, then, we divide the related space (plane), into a grid structure and then we find the clusters.
Age
Salary (10,000)
Our “space” is this plane
20 30 40 50 60
8
7
6
5
4
3
2
1
0
Techniques for Grid-Based Clustering
The following are some techniques that are used to perform Grid-Based Clustering: CLIQUE (CLustering In QUest.) STING (STatistical Information Grid.) WaveCluster
Looking at CLIQUE as an Example
CLIQUE is used for the clustering of high-dimensional data present in large tables. By high-dimensional data we mean records that have many attributes.
CLIQUE identifies the dense units in the subspaces of high dimensional data space, and uses these subspaces to provide more efficient clustering.
How Does CLIQUE Work?
Let us say that we have a set of records that we would like to cluster in terms of n-attributes.
So, we are dealing with an n-dimensional space.
MAJOR STEPS : CLIQUE partitions each subspace that has
dimension 1 into the same number of equal length intervals.
Using this as basis, it partitions the n-dimensional data space into non-overlapping rectangular units.
CLIQUE: Major Steps (Cont.) Now CLIQUE’S goal is to identify the dense n-
dimensional units. It does this in the following way: CLIQUE finds dense units of higher
dimensionality by finding the dense units in the subspaces.
So, for example if we are dealing with a 3-dimensional space, CLIQUE finds the dense units in the 3 related PLANES (2-dimensional subspaces.)
It then intersects the extension of the subspaces representing the dense units to form a candidate search space in which dense units of higher dimensionality would exist.
CLIQUE: Major Steps. (Cont.)
Each maximal set of connected dense units is considered a cluster.
Using this definition, the dense units in the subspaces are examined in order to find clusters in the subspaces.
The information of the subspaces is then used to find clusters in the n-dimensional space.
It must be noted that all cluster boundaries are either horizontal or vertical. This is due to the nature of the rectangular grid cells.
Example for CLIQUE
Let us say that we want to cluster a set of records that have three attributes, namely, salary, vacation and age.
The data space for the this data would be 3-dimensional.
age
salary
vacation
Example (Cont.)
After plotting the data objects, each dimension, (i.e., salary, vacation and age) is split into intervals of equal length.
Then we form a 3-dimensional grid on the space, each unit of which would be a 3-D rectangle.
Now, our goal is to find the dense 3-D rectangular units.
Example (Cont.)
To do this, we find the dense units of the subspaces of this 3-d space.
So, we find the dense units with respect to age for salary. This means that we look at the salary-age plane and find all the 2-D rectangular units that are dense.
We also find the dense 2-D rectangular units for the vacation-age plane.
Example 1 Sa
lary
(1
0,00
0)
20 30 40 50 60age
54
31
26
70
20 30 40 50 60age
54
31
26
70
Vac
atio
n(w
eek)
20 30 40 50 60age
54
31
26
70
Vac
atio
n(w
eek)
Example (Cont.)
Now let us try to visualize the dense units of the two planes on the following 3-d figure :
age
Vac
atio
n
Salary 30 50
age
Vac
atio
n
Salary 30 50
= 3
Example (Cont.)
We can extend the dense areas in the vacation-age plane inwards.
We can extend the dense areas in the salary-age plane upwards.
The intersection of these two spaces would give us a candidate search space in which 3-dimensional dense units exist.
We then find the dense units in the salary-vacation plane and we form an extension of the subspace that represents these dense units.
Example (Cont.)
Now, we perform an intersection of the candidate search space with the extension of the dense units of the salary-vacation plane, in order to get all the 3-d dense units.
So, What was the main idea? We used the dense units in
subspaces in order to find the dense units in the 3-dimensional space.
After finding the dense units, it is very easy to find clusters.
Reflecting upon CLIQUE
Why does CLIQUE confine its search for dense units in high dimensions to the intersection of dense units in subspaces?
Because the Apriori property employs prior knowledge of the items in the search space so that portions of the space can be pruned.
The property for CLIQUE says that if a k-dimensional unit is dense then so are its projections in the (k-1) dimensional space.
Strength and Weakness of CLIQUE Strength
It automatically finds subspaces of the highest dimensionality such that high density clusters exist in those subspaces.
It is quite efficient. It is insensitive to the order of records in input and
does not presume some canonical data distribution. It scales linearly with the size of input and has good
scalability as the number of dimensions in the data increases.
Weakness The accuracy of the clustering result may be
degraded at the expense of simplicity of the simplicity of this method.
CLIQUE: The Major Steps• Partition the data space and find the number of points
that lie inside each cell of the partition.• Identify the subspaces that contain clusters using the
Apriori principle• Identify clusters:
– Determine dense units in all subspaces of interests– Determine connected dense units in all subspaces of
interests.
• Generate minimal description for the clusters– Determine maximal regions that cover a cluster of connected
dense units for each cluster– Determination of minimal cover for each cluster
Sala
ry
(10,
000)
20 30 40 50 60age
54
31
26
70
20 30 40 50 60age
54
31
26
70
Vaca
tion
(wee
k)age
Vaca
tion
Salary 30 50
= 3
Strength and Weakness of CLIQUE
• Strength – It automatically finds subspaces of the highest
dimensionality such that high density clusters exist in those subspaces
– It is insensitive to the order of records in input and does not presume some canonical data distribution
– It scales linearly with the size of input and has good scalability as the number of dimensions in the data increases
• Weakness– The accuracy of the clustering result may be degraded at
the expense of simplicity of the method
Global Dimensionality Reduction (GDR)
First PrincipalComponent (PC) First PC
• works well only when data is globally correlated• otherwise too many false positives result in high query
cost• solution: find local correlations instead of global
correlation
Local Dimensionality Reduction (LDR)
First PC
GDR LDRFirst PC of Cluster1
Cluster1
Cluster2
First PC of Cluster2
Correlated Cluster
Second PC(eliminated dim.)
Centroid of cluster (projection of mean on eliminated dim)
First PC(retained dim.)
Mean of all points in cluster
A set of locally correlated points = <PCs, subspace dim, centroid, points>
Reconstruction Distance
Centroid of cluster
First PC(retained dim)
Second PC(eliminated dim)
Point QProjection of Q on eliminated dim
ReconstructionDistance(Q,S)
Reconstruction Distance Bound
Centroid
First PC(retained dim)
Second PC(eliminated dim)
£ MaxReconDist
£ MaxReconDist
ReconDist(P, S) £ MaxReconDist, " P in S
Other constraints
• Dimensionality bound: A cluster must not retain any more dimensions necessary and subspace dimensionality £ MaxDim
• Size bound: number of points in the cluster ³ MinSize
Clustering Algorithm Step 1: Construct Spatial Clusters
• Choose a set of well-scattered points as centroids (piercing set) from random sample
• Group each point P in the dataset with its closest centroid C if the Dist(P,C) £
Clustering Algorithm Step 2: Choose PCs for each cluster
• Compute PCs
Clustering AlgorithmStep 3: Compute Subspace Dimensionality
0
0.2
0.4
0.6
0.8
1
0 2 4 6 8 10 12 14 16
#dims retained
Frac
poi
nts
obey
ing
reco
ns. b
ound
• Assign each point to cluster that needs min dim. to accommodate it
• Subspace dim. for each cluster is the min # dims to retain to keep most points
Clustering Algorithm Step 4: Recluster points
• Assign each point P to the cluster S such that ReconDist(P,S) £ MaxReconDist
• If multiple such clusters, assign to first cluster (overcomes “splitting” problem)
Emptyclusters
Clustering algorithmStep 5: Map points
• Eliminate small clusters
• Map each point to subspace (also store reconstruction dist.)
Map
Clustering algorithmStep 6: Iterate
• Iterate for more clusters as long as new clusters are being found among outliers
• Overall Complexity: 3 passes, O(ND2K)
Experiments (Part 1)• Precision Experiments:
– Compare information loss in GDR and LDR for same reduced dimensionality
– Precision = |Orig. Space Result|/|Reduced Space Result| (for range queries)
– Note: precision measures efficiency, not answer quality
Datasets• Synthetic dataset:
– 64-d data, 100,000 points, generates clusters in different subspaces (cluster sizes and subspace dimensionalities follow Zipf distribution), contains noise
• Real dataset:– 64-d data (8X8 color histograms extracted from 70,000
images in Corel collection), available at http://kdd.ics.uci.edu/databases/CorelFeatures
Precision Experiments (1)
0
0.5
1
Prec.
0 0.5 1 2
Skew in cluster size
Sensitivity of prec. to skew
LDR GDR
0
0.5
1
Prec.
1 2 5 10
Number of clusters
Sensitivity of prec. to num clus
LDR GDR
Precision Experiments (2)
0
0.5
1
Prec.
0 0.02 0.05 0.1 0.2
Degree of Correlation
Sensitivity of prec. to correlation
LDR GDR
0
0.5
1
Prec.
7 10 12 14 23 42
Reduced dim
Sensitivity of prec. to reduced dim
LDR GDR
Index structureRoot containing pointers to root of each cluster index (also stores PCs and subspace dim.)
Index on
Cluster 1
Index on
Cluster K
Set of outliers (no index: sequential scan)
Properties: (1) disk based (2) height £ 1 + height(original space index) (3) almost balanced
Cluster Indices• For each cluster S, multidimensional index on (d+1)-dimensional space instead
of d-dimensional space:– NewImage(P,S)[j] = projection of P along jth PC for 1 £ j £ d
= ReconDist(P,S) for j= d+1
• Better estimate: D(NewImage(P,S), NewImage(Q,S)) ³
D(Image(P,S), Image(Q,S))
• Correctness: Lower Bounding Lemma D(NewImage(P,S), NewImage(Q,S)) £ D(P,Q)
Effect of Extra dimension
I/O cost
0200400600800
1000
12 14 15 17 19 30 34
Reduced dimensionality
# ra
nd d
isk
acce
sses
d-dim(d+1)-dim
Outlier Index
• Retain all dimensions• May build an index, else use sequential
scan (we use sequential scan for our experiments)
Query Support• Correctness:
– Query result same as original space index
• Point query, Range Query, k-NN query– similar to algorithms in multidimensional index structures– see paper for details
• Dynamic insertions and deletions– see paper for details
Experiments (Part 2)• Cost Experiments:
– Compare linear scan, Original Space Index(OSI), GDR and LDR in terms of I/O and CPU costs. We used hybrid tree index structure for OSI, GDR and LDR.
• Cost Formulae:– Linear Scan: I/O cost (#rand accesses)=file_size/10, CPU cost– OSI: I/O cost=num index nodes visited, CPU cost– GDR: I/O cost=index cost+post processing cost (to eliminate false
positives), CPU cost– LDR: I/O cost=index cost+post processing cost+outlier_file_size/10,
CPU cost
I/O Cost (#random disk accesses)
I/O cost comparison
0500
10001500200025003000
7 10 12 14 23 42 50 60
Reduced dim
#rand disk acc
LDRGDROSILin Scan
CPU Cost (only computation time)
CPU cost comparison
02040
6080
7 10 12 14 23 42
Reduced dim
CPU cost (sec)
LDRGDROSILin Scan
Conclusion• LDR is a powerful dimensionality reduction technique
for high dimensional data– reduces dimensionality with lower loss in distance
information compared to GDR– achieves significantly lower query cost compared to linear
scan, original space index and GDR
• LDR has applications beyond high dimensional indexing
05/03/2023 CLIQUE clustering algorithm 81
Motivation
An object typically has dozens of attributes, the domain for each attribute can be large
Require the user to specify the subspace for cluster analysis
User-identification of subspaces is quite error-prone.
05/03/2023 CLIQUE clustering algorithm 82
The Contribution of CLIQUE
Automatically find subspaces with high-density clusters in high dimensional attribute space
05/03/2023 CLIQUE clustering algorithm 83
Background
A1, A2, ﹍ , Ad is the dimensions of S : A = {A1, A2, ﹍ , Ad} S = A1 × A2 × ﹍ × Ad
units : Partition every dimension into ξ intervals of equal
length unit u: {u1, u2, ﹍ , ud} where ui = [ li, hi )
05/03/2023 CLIQUE clustering algorithm 84
Background(Cont.)
Selectivity: the fraction of total data points contained in the unit
Dense unit: selectivity(u) >
Cluster: a maximal set of connected dense units
05/03/2023 CLIQUE clustering algorithm 85
1
Example
05/03/2023 CLIQUE clustering algorithm 86
Background(Cont.) region: axis-parallel rectangular set
RC=R: R is contained in C
maximal region: no proper superset of R is contained in C
minimal description: a non-redundant covering of the cluster with maximal regions
05/03/2023 CLIQUE clustering algorithm 87
Example
((30£age50)(4 £salary8)) ((40£age60)(2 £salary6))2
05/03/2023 CLIQUE clustering algorithm 88
CLIQUE Algorithm
1. Identification of dense units
2. Identification of clusters.
3. Generation of minimal description
05/03/2023 CLIQUE clustering algorithm 89
Identification of dense units
bottom-up algorithm : like Apriori algorithm
Monotonicity : If a collection of points S is a cluster in a k-dimensional
space, then S is also part of a cluster in any (k–1)-dimensional projections of this space.
05/03/2023 CLIQUE clustering algorithm 90
Algorithm
1. determine 1-dimensional dense units2. k = 23. generate candidate k-dimensional units from (k-1)-dimensional dense units4. if candidates are not empty
find dense unitsk = k + 1go to step 3
05/03/2023 CLIQUE clustering algorithm 91
Algorithm - Candidate Generation
Self-joininginsert into Ck
select u1.[l1, h1), u1.[l2, h2), ﹍ , u1.[lk-1, hk-1), u2.[lk-1, hk-1)from Dk-1 u1, Dk-1 u2
where u1.a1 = u2.a1, u1.l1 = u2.l1, u2.h1 = u2.h1, u1.a2 = u2.a2, u1.l2 = u2.l2, u2.h2 = u2.h2, ﹍ , u1.ak-2 = u2.ak-2, u1.lk-2 = u2.lk-2, u2.hk-2 = u2.hk-2, u1.ak-1 < u2.ak-1
Pruning
05/03/2023 CLIQUE clustering algorithm 92
05/03/2023 CLIQUE clustering algorithm 93
Prune subspaces
Objective: use only the dense units that lie in “interesting” subspaces
MDL principle: encode the input data under a given model and
select the encoding that minimizes the code length.
05/03/2023 CLIQUE clustering algorithm 94
Prune subspaces (Cont.)
Group together dense units in the same subspace Compute the number of points for each subspace
Sort subspaces in the descending order of their coverage Minimize the total length of the encoding
|))((|log))((log
|))((|log))((log)(
12
12
ixi
ixiiCL
Pnji
SP
Iij
SI
j
j
££
££
2
2
jij Su iS ucountx )(
05/03/2023 CLIQUE clustering algorithm 95
Prune subspaces (Cont.)
Partitioning of the subspaces into selected and pruned sets
05/03/2023 CLIQUE clustering algorithm 96
Finding Clusters
05/03/2023 CLIQUE clustering algorithm 97
Generating minimal cluster descriptions
R is a cover of C
optimal cover: NP-hard
solution to the problem : greedily cover the cluster by a number of maximal
regions discard the redundant regions
05/03/2023 CLIQUE clustering algorithm 98
Greedy growth
1) begin with an arbitrary dense unit u C
2) Greedily grow a maximal region covering u, add to R
3) repeat 2) with all uk C are covered by some maximal regions in R
05/03/2023 CLIQUE clustering algorithm 99
Minimal Cover
Remove from the cover the smallest maximal region which is redundant.
Repeat the procedure until no maximal region can be removed.
100
Performance Experiments
05/03/2023 CLIQUE clustering algorithm 101
Comparison with Birch, DBScan
Concludes that CLIQUE performs better than Birch, DBScan
05/03/2023 CLIQUE clustering algorithm 102
Real data experimental result datasets :
insurance industry (Insur1, Insur2)
department store (Store) bank (Bank) In all cases, we discovered
meaningful clusters embedded in lower dimensional subspaces.
05/03/2023 CLIQUE clustering algorithm 103
Strength automatically finds clusters in subspaces insensitive to the order of records not presume some canonical data
distribution scales linearly with the size of input tolerant of missing values
05/03/2023 CLIQUE clustering algorithm 104
Weakness depends on some parameters that hard to
pre-select ξ (partition threshold) (density threshold)
some potential clusters will be lost in the density-units prune procedures.
the correctness of the algorithm degrades
What or who is STING?
A singer who was the lead singer of the band Police and then took up solo career and won many grammy’s.
The bite of a scorpion. A Statistical
Information Grid Approach to Spatial Data Mining.
All of the above.
What is Spatial Data?Many definitions according to specific areasAccording to GIS Spatial data may be thought of as features
located on or referenced to the Earth's surface, such as roads, streams, political boundaries, schools, land use classifications, property ownership parcels, drinking water intakes, pollution discharge sites - in short, anything that can be mapped.
Geographic features are stored as a series of coordinate values. Each point along a road or other feature is defined by positional coordinate value, such as longitude and latitude.
The GIS stores and manages the data not as a map but as a series of layers or, as they are sometimes called, themes
When viewed in a GIS, these layersvisually appear as one graphic, but areactually still independent of each other.This allows changes to specific themes, without affecting the others.
Discussion Question 1: So can you define spatial Data Generically????
•Spatial database systems aim at storing, retrieving, manipulating, querying, and analyzing geometric data.
•Special data types are necessary to model geometry and to suitably represent geometric data in database systems. These data types are usually called spatial data types, such as point, line, and region but also include more complex types like partitions and graphs (networks).
•Data Type understanding is a prerequisite for an effective construction of important components of a spatial database system (like spatial index structures, optimizers for spatial data, spatial query languages, storage management, and graphical user interfaces) and for a cooperation with extensible DBMS providing spatial type extension packages (like spatial data blades and cartridges).
•Excellent tutorial on spatial data and data types available at: http://www.informatik.fernuni-hagen.de/import/pi4/schneider/abstracts/TutorialSDT.html
What are Spatial Databases?
Different Grid Levels during Query Processing.
Pennsylvania Spatial Data Access http://www.pasda.psu.edu/
The Missouri Spatial Data Information Service http://msdis.missouri.edu/
National Spatial Data Infrastructure http://www.fgdc.gov/nsdi/nsdi.html
Michigan Department of Natural Resources Online www.dnr.state.mi.us/spatialdatalibrary/
Georgia Spatial Data Infrastructure Home Page www.gis.state.ga.us/
Free GIS Data - GIS Data Depot www.gisdatadepot.com
Spatial Data Resources
Spatial Data Mining Discovery of interesting characteristics
and patterns that may implicitly exist in spatial databases.
Huge amount of data specialized in nature. Clustering and region oriented queries are
common problems in this domain. We deal with high dimensional data
generally. Applications: GIS, Medical Imaging etc.
•Huge Amount of Data Specialized in Nature
Problems????????
•Complexity•Defining of geometric patterns and region oriented queries•Conceptual nature of problem!•Spatial Data Accessing
STING-An Introduction
•STING is a grid based method to efficiently process many common region oriented queries on a set of points
•What defines region? You tell me! Essentially it is a set of points satisfying some criterion
•It is a hierarchical Method. The idea is to capture statistical information associated with spatial cells in such a manner that the whole classes of queries can be answered without referring to the individual objects.
•Complexity is hence even less than O(n) infact what do you think it will be???
•Link to Paper: http://citeseer.nj.nec.com/wang97sting.html
Related Work
Spatial Data Mining
Generalization Based Knowledge Discovery Clustering Based Methods
CLARANS BIRCH DBSCANSpatial Data Dominant Non-Spatial Data Dominant
Great comparison of Clustering algorithmshttp://www.cs.ualberta.ca/~joerg/papers/KI-Journal.pdf
Generalization Based Approaches
Two types: Spatial Data Dominant and Non- Spatial Data Dominant
Both of these require that a generalization hierarchy is given explicitly by experts or is somehow generated automatically.
Quality of mined data depends on the structure of the hierarchy.
Computational Complexity O(nlogn) So the onus shifted to developing algorithms
which discover characteristics directly from data. This was the motivation to move to clustering algorithms
Clustering Based Approaches
BIRCH: Already covered Remember it?? Complexity??
The problem with BIRCH is that it does not work well with clusters which are not spherical.
DBSCAN: Already covered Remember it?? Complexity??
The Global Parameter Eps determination in DBSCAN requires human participation
When the point set to be clustered is the response set of objects with some qualifications, then determination of Eps must be done each time and cost is hence higher.
Clustering Based Approaches
CLARANS: Clustering Large Applications based upon RANdomized Search.
Although claims have been made on it being linear it is essentially quadratic.
The computational Complexity is at least Ώ(KN2) where N is the number of data point and K is the number of clusters.
Quality of results can not be guaranteed when N is large as we use Randomized Search
Optimization with Randomized Search Heuristics The (A)NFL Theorem, Realistic Scenarios, and Dicult Functions
Related Work
All the approaches described in previous slides
are all query dependent approaches
The structure of queries influence the structure of
the algorithm and cannot be generalized to all
queries.
As they scan all the data points the complexity
will at least be O(N)
STING THE OVERVIEW
Spatial Area is divided into rectangular cells Different levels of cells corresponding to different
resolution and these cells have a hierarchical structure. Each cell at a higher level is partitioned into number of
cells of the next lower level Statistical information of each cell is calculated and stored
beforehand and is used to answer queries
GRID CELL HIERARCHY
Each Cell at (i-1)th level has 4
children at ith level (can be
changed)
The size of leaf cell is dependent
on the density of objects.
Generally it should be from several
dozens to thousands
For each cell we have attribute-dependent and attribute-independent parameters
The attribute independent parameter is number of objects in a cell-n
For attribute dependent parameters it is assumed that for each object its attributes have numerical values.
For each Numerical attribute we have the following five parameters
GRID CELL HIERARCHY
GRID CELL HIERARCHY
m- mean of all values in this cell s- standard deviation of all values in
this cell min-the minimum value of the
attribute in this cell max-the minimum value of the
attribute in this cell distribution-the type of distribution
this cell follows. (This is of enumeration type)
Parameter Generation
•The determination of dist parameter is as follows•First the dist is set to distribution followed by most point•An estimate is made on number of conflicting points confl according to following Rules
1) if disti is not equal to dist, m=mi and s=si then confl is increased by amount ni.2) if disti is not equal to dist, m=mi or s=si but not both then confl is set to n.3) if disti=dist and m=mi and s=si then confl is not changed4) if disti = dist, m=mi or s=si but not both then confl is set to n.
Finally if confl/n is greater than a threshold (say 0.05) then dist is set to none orOriginal dist is retained
Parameter Generation
i 1 2 3 4ni 100 50 60 10mi 20.1 19.7 21 20.5si 2.3 2.2 2.4 2.1mini 4.5 5.5 3.8 7max
i 36 34 37 40
distiNorm
alNorm
alNorm
alNon
e
The parameters of the current cell areN=220m=20.27s=2.37min=3.8max=40dist=NORMAL
This is so because there are 220 data points out of which 10 are not NORMALSo confl/n=10/220=0.045<0.05 hence it is still NORMAL.
The parameters are calculated only once so overall compilation time is O(N)
But querying requires much less time as we only scan the number of grid cellsK i.e. O(K)
Query Types
If hierarchical structure cannot answer a query then can go to underlying database
SQL like Language used to describe queries
Two types of common queries found: one is to find region specifying certain constraints and other take in a region and return some attribute of the region
Query Type: Examples
Algorithm Top down querying. Examine cells at a higher level determine
if the cell is relevant to query at some confidence level. This likelihood can be defined as the proportion of objects in this cell that satisfy the query conditions. After obtaining the confidence interval, we label this cell to be relevant or not relevant at the specified confidence level.
After doing so for the present layer process is repeated for the
children cells of the RELEVANT cells in the present layer
only!!!
Procedure continues till the bottom most layer
Find region formed by relevant cells and return them If not satisfactory retrieve those data that fall into the
relevant cells from database and do some further processing.
After all cells are labeled as relevant or not relevant, we can easily find all regions that satisfy the density specified by Breadth First Search.
For a relevant cell, we examine cells within a certain distance d from the center of the current cell to see if the average density within this small area is greater than density specified.
If yes the cells are put into a queue Step 2 and 3 are repeated for all the cells in
the queue except cells previously examined are omitted.
When the queue is empty we get one region.
Algorithm
The distance d =max (l, √(f/c∏) l, c, f are the side length of bottom
layer cell, the specified density and small constant number set by STING (does not vary from query to another)
L is usually the dominant term so we generally have to examine the neighborhood term. If only granularity is very small do we need examine very cell at that distance rather than just the neighborhood.
Algorithm
ExampleGiven Data: Houses one of the attribute is priceQuery:“Find those regions with area at least A where the number of houses per unit area is at least c and at least b% of the houses have price between a and b with(1 - a) confidence” where a < b. Here, a could be -æ and b could be +æ. This query can be written as
We begin from the top level working our way down. Assume the dist type is NORMALFirst we calculate the proportion of houses whose price lies between [a,b]The probability that price lies between a and b is
m and s are mean and standard deviation of all prices.
Example Now as we assume prices to be independent of m and s
the number of houses with price range [a, b] has a binomial distribution with parameters n and p where n is number of houses. Now we consider the following cases according to n, np and n(1-p)
a) n<=30: binomial distribution used to determine confidence interval of the number of houses whose prices fall into [a, b], and divide it by n to get the confidence interval for the proportion.b) When n > 30, n p ³ 5, and n(1 - p ) ³ 5, the proportion that the price falls in [a, b] has a normal distribution Then 100(1 - alpha)% confidence interval of the proportion is c) When n>30 but np<5 , the Poissons distribution with parameters is used for approximation. d) When n>30 but n(1-p)<5, we can calculate the proportion of houses (X) whose price is not in [a,b] using Poissons distribution with n(1-p) and 1-X is the proportion of houses whose prices is in [a,b].
Example
Once we have the confidence interval or the estimated range [p1, p2], we can label this cell as relevant or not relevant.
Let S be area of cells at bottom layer. If p1xn<Sxcx %, we can label as not relevant otherwise as relevant
Analysis of STING
Step one takes constant time Step 2 and 3 total time is proportional
to the total number of cells in the hierarchy.
Total number of cells is 1.33K, where K is number of cells at bottom layer.
In all cases it is found or claimed to be O(K)
Discussion Question: what is the complexity if we need to go to step 7 in the algorithm??
Quality
STING under the following sufficient condition guarantee that if a region satisfies the specification of the query then it is returned.
Let F be a region. The width of F is defined as the side length of the maximum square that can fit in F.
Limiting Behavior of STING
The regions returned by Sting are an approximation of the result by DBSCAN. As the granularity approaches zero the regions returned by STING approaches result of DBSCAN.
SO worst case complexity is O(nlogn)!!!!!
Performance measureCase A: Normal DistributionQuery in e.g. answered in 0.2 secStructure generation: 9.8 second
Case A: None Query in e.g. answered in 0.22 secStructure generation: 9.7 second
Performance measure
Used a benchmark called SEQUOLA 2000 to compare STING, DBSCAN, CLARANS
All the previous algorithms have three phases in query answering
1. Find Query Response2. Build auxiliary structure3. Do clustering STING does all of this in one step so is inherently
better.
Discussion Question
“STING is trivially parallelizable.” Comment why and what is the importance of this statement?
References STING : Statistical Information Grid approach to spatial data
mining. Wei Wang et al. Optimization with Randomized Search Heuristics The (A)NFL
Theorem, Realistic Scenarios, and Dicult Functions. Stefan Droste et al.
Efficient and Effective clustering Method for spatial data mining. R. Ng et al.
BIRCH: An efficient data clustering method for very large databases. T Zhang et al.
Tutorial on Spatial data types: http://www.informatik.fernuni-hagen.de/import/pi4/schneider/abstracts/TutorialSDT.html
An efficient Approach to Clustering in Large Multimedia Databases with Noise. A Hinneburg et al.
Comparison of clustering algorithms : http://www.cs.ualberta.ca/~joerg/papers/KI-Journal.pdf
MotivationAll previous clustering algorithm are query
dependentThey are built for one query and generally
no use for other query.Need a separate scan for each query.So computation more complex at least
O(n).So we need a structure out of Database so
that various queries can be answered without rescanning.
BasicsGrid based method-quantizes the object space
into a finite number of cells that form a grid structure on which all of the operations for clustering are performed
Develop hierarchical Structure out of given data and answer various queries efficiently.
Every level of hierarchy consist of cellsAnswering a query is not O(n) where n is
the number of elements in the database
A hierarchical structure for STING clustering
continue …..
The root of the hierarchy be at level 1Cell in level i corresponds to the union of
the areas of its children at level i + 1Cell at a higher level is partitioned to form a
number of cells of the next lower levelStatistical information of each cell is
calculated and stored beforehand and is used to answer queries
Cell parameterAttribute Independent parameter-n- number of objects (points) in this cell
Attribute dependent parameters-m - mean of all values in this cells - standard deviation of all values of the attribute in thiscellmin - the minimum value of the attribute in this cellmax - the maximum value of the attribute in this cell distribution - the type of distribution that the attribute value in this cell follows
Parameter Generationn, m, s, min, and max of bottom level
cells arecalculated directly from data
Distribution can be either assigned by user or can be obtained by hypothetical tests- χ2 test
Parameters of higher level cells is calculated from parameter of lower level cells.
continue…..n, m, s, min, max, dist be parameters of
current cellni, mi, si, mini, maxi and disti be
parameters of corresponding lower level cells
dist for Parent CellSet dist as the distribution type followed by most
points inthis cell
Now check for conflicting points in the child cells call itconfl.
1.If disti ≠ dist, mi ≈ m and si ≈ s, then confl is increased by anamount of ni;
2.If disti ≠ dist, but either mi ≈ m or si ≈ s is not satisfied, then set confl to n
3.If disti = dist, mi ≈ m and si ≈ s, then confl is increased by 0;
4.If disti = dist, but either mi ≈ m or si ≈ s is not satisfied, then confl is set to n.
continue…..I
fis greater than a threshold t set dist as NONE.Other wise keep the
original type.Example :
continue…..Parameterfor parent cell
would ben = 220min = 3.8
m = 20.27max = 40
s = 2.37dist = NORMAL
210 points whose distribution type is NORMAL
Set dist of parent as Normalconfl = 10 = 0.045 < 0.05 so keep the
original.
Query typesSTING structure is capableof answering
various queriesBut if it doesn’t then we always have the
underlying DatabaseEven if statistical information is not sufficient
to answer queries we can still generate possible set of answers.
Common queriesSelect regions that satisfy certain
conditionsSelect the maximal regions that have at least 100 houses per unit area and at least 70% of the house prices are above $400K and with totalareaat least 100 units with 90% confidence
SELECT REGIONFROM house-mapWHERE DENSITY IN (100, ∞)AND price RANGE (400000, ∞) WITH PERCENT (0.7, 1) AND AREA (100, ∞)AND WITH CONFIDENCE 0.9
continue….Selects regions and returns some function of
the regionSelect the range of age of houses in those maximal regionswhere there areat least 100 houses perunit areaand at least 70% of the houses have price between $150Kand $300K witharea at least 100 units in California.
SELECT RANGE(age)FROM house-mapWHERE DENSITY IN (100, ∞)AND price RANGE (150000, 300000) WITH PERCENT (0.7, 1) AND AREA (100, ∞)AND LOCATION California
AlgorithmWith the hierarchical structure of grid cells on
hand, we can use a top-down approach to answer spatial data mining queries
For any query, we begin by examining cells on a high level layer
calculate the likelihood that this cell is relevant to the query at some confidence level using the parameters of this cell
If the distribution type is NONE, we estimate the likelihood using some distribution free techniques instead
continue….After we obtain the confidence interval, we
label this cell to be relevant or not relevant at the specified confidence level
Proceed to the next layer but only consider the Childs of relevant cells of upper layer
We repeat this until we reach to the final layer
Relevant cells of final layer have enough statistical information to give satisfactory result to query.
However for accurate mining we may refer to data corresponding to relevant cells and further process it.
Finding regionsAfter we have got all the relevant cells at the
final level we need to output regions that satisfies the query
We can do it using Breadth First Search
Breadth First Searchwe examine cells within a certain
distance from the center of current cellIf the average density within this small
area isgreater than the density specified mark this area
Put the relevant cells just examined in the queue.
Take element from queue repeat the same procedure except that only those relevant cells that are not examined before are enqueued. When queue is empty we have identified one region.
Statistical Information Grid-based Algorithm1.Determine a layer to begin with.2.For eachcell of this layer, we calculate the
confidence interval (or estimated range) of probability that this cell is relevant to the query.
3.From the interval calculated above, we label the cell as relevant or not relevant.
4.If this layer is the bottom layer, go to Step 6; otherwise, go to Step 5.
5.We go down the hierarchy structure by one level. Go to Step 2 for those cells that form the relevant cells of the higher level layer.
6.If the specification of the query is met, go to Step 8; otherwise, go to Step 7.
7.Retrieve those data fall into the relevant cells and do further processing. Return the result that meet the requirement of the query. Go to Step 9.
8.Find the regions of relevant cells. Return those regions that meet the requirement of the query. Go to Step 9.
9.Stop.
Time Analysis:Step 1 takes constant time. Steps 2 and 3
requireconstant time.
The total time is less than or equal to the total number of cells in our hierarchical structure.
Notice that the total number of cells is 1.33K, where K is the number of cells at bottom layer.
So the overall computation complexity on the grid hierarchy structure is O(K)
Time Analysis:STING goes through the database once to
compute thestatistical parameters of the cells
time complexity of generating clusters is O(n), where n is the total number of objects.
After generating the hierarchical structure, the query processing time is O(g), where g is the total number of grid cells at the lowest level, which is usually much smaller than n.
Comparison
Definitions That Need to Be Known
Spatial Data: Data that have a spatial or location
component. These are objects that themselves are located
in physical space. Examples: My house, lake Geneva, New York
City, etc. Spatial Area:
The area that encompasses the locations of all the spatial data is called spatial area.
STING (Introduction)
STING is used for performing clustering on spatial data.
STING uses a hierarchical multi resolution grid data structure to partition the spatial area.
STINGS big benefit is that it processes many common “region oriented” queries on a set of points, efficiently.
We want to cluster the records that are in a spatial table in terms of location.
Placement of a record in a grid cell is completely determined by its physical location.
Hierarchical Structure of Each Grid Cell The spatial area is divided into
rectangular cells. (Using latitude and longitude.)
Each cell forms a hierarchical structure. This means that each cell at a higher
level is further partitioned into 4 smaller cells in the lower level.
In other words each cell at the ith level (except the leaves) has 4 children in the i+1 level.
The union of the 4 children cells would give back the parent cell in the level above them.
Hierarchical Structure of Cells (Cont.)
The size of the leaf level cells and the number of layers depends upon how much granularity the user wants.
So, Why do we have a hierarchical structure for cells?
We have them in order to provide a better granularity, or higher resolution.
A Hierarchical Structure for Sting Clustering
Statistical Parameters Stored in each Cell
For each cell in each layer we have
attribute dependent and attribute independent parameters. Attribute Independent Parameter:
Count : number of records in this cell. Attribute Dependent Parameter:
(We are assuming that our attribute values are real numbers.)
Statistical Parameters (Cont.)
For each attribute of each cell we store the following parameters:
M mean of all values of each attribute in this cell.
S Standard Deviation of all values of each attribute in this cell.
Min The minimum value for each attribute in this cell.
Max The maximum value for each attribute in this cell.
Distribution The type of distribution that the attribute value in this cell follows. (e.g. normal, exponential, etc.) None is assigned to “Distribution” if the distribution is unknown.
Storing of Statistical Parameters
Statistical information regarding the attributes in each grid cell, for each layer are pre-computed and stored before hand.
The statistical parameters for the cells in the lowest layer is computed directly from the values that are present in the table.
The Statistical parameters for the cells in all the other levels are computed from their respective children cells that are in the lower level.
How are Queries Processed ? STING can answer many queries, (especially
region queries) efficiently, because we don’t have to access full database.
How are spatial data queries processed? We use a top-down approach to answer spatial
data queries. Start from a pre-selected layer-typically with a
small number of cells. The pre-selected layer does not have to be the
top most layer. For each cell in the current layer compute the
confidence interval (or estimated range of probability) reflecting the cells relevance to the given query.
Query Processing (Cont.)
The confidence interval is calculated by using the statistical parameters of each cell.
Remove irrelevant cells from further consideration.
When finished with the current layer, proceed to the next lower level.
Processing of the next lower level examines only the remaining relevant cells.
Repeat this process until the bottom layer is reached.
Sample Query Examples Assume that the spatial area is the map of the
regions of Long Island, Brooklyn and Queens. Our records represent apartments that are
present throughout the above region. Query : “ Find all the apartments that are for
rent near Stony Brook University that have a rent range of: $800 to $1000”
The above query depend upon the parameter “near.” For our example near means within 15 miles of Stony Brook University.