discovering interesting regions in spatial data sets

Discovering Interesting Regions inDiscovering Interesting Regions inSpatial Data SetsSpatial Data Sets

Christoph F. Eick for Data Mining Class

1. Motivation: Examples of Region Discovery

2. Region Discovery Framework

3. A Fitness For Hotspot Discovery

4. Other Fitness Functions

5. A Family of Clustering Algorithms for Region Discovery

6. Summary

Next 2-3 ClassesNext 2-3 Classes

1. Region Discovery Framework

2. DBSCAN

3. Hierarchical Clustering

4. Clustering Algorithms for Region Discovery: Clever,…

5. Critical Issues with Respect to Clustering

6. Programming Project-specific Discussion

7. Similarity Assessment

Ch. Eick: Introduction Region Discovery

1. Motivation: Examples of Region Discovery1. Motivation: Examples of Region Discovery

RD-Algorithm

Application 1: Hot-spot Discovery [this presentation, [EVJW07]Application 2: Regional Association Rule Mining and Scoping [DEWY06, DEYWN07]Application 3: Find Interesting Regions with respect to a Continuous VariableApplication 4: Regional Co-location Mining [EPWSN07]Application 5: Find “representative” regions (Sampling)

Wells in Texas:Green: safe well with respect to arsenicRed: unsafe well

=1.01

=1.04


2. Region Discovery Framework2. Region Discovery Framework

• We assume we have spatial or spatio-temporal datasets that have the following structure:

(x,y,[z],[t];<non-spatial attributes>) e.g. (longitude, lattitude, class_variable) or (longitude,

lattitude, continous_variable)• Clustering occurs in the (x,y,[z],[t])-space; regions are

found in this space.• The non-spatial attributes are used by the fitness

function but neither in distance computations nor by the clustering algorithm itself.

• For the remainder of the talk, we view region discovery as a clustering task and assume that regions and clusters are the same


Region Discovery Framework ContinuedRegion Discovery Framework Continued

The algorithms we currently investigate solve the following problem:Given:A dataset O with a schema RA distance function d defined on instances of RA fitness function q(X) that evaluates clustering X={c1,…,ck} as follows:

q(X)= cX reward(c)=cX interestingness(c)size(c) with >1

Objective:Find c1,…,ck O such that:1. cicj= if ij2. X={c1,…,ck} maximizes q(X)3. All cluster ciX are contiguous (each pair of objects belonging to ci has to

be delaunay-connected with respect to ci and to d)4. c1,…,ck O 5. c1,…,ck are usually ranked based on the reward each cluster receives, and

low reward clusters are frequently not reported


Challenges for Region DiscoveryChallenges for Region Discovery

1. Recall and precision with respect to the discovered regions should be high

2. Definition of measures of interestingness and of corresponding parameterized reward-based fitness functions that capture “what domain experts find interesting in spatial datasets”

3. Detection of regions at different levels of granularities (from very local to almost global patterns)

4. Detection of regions of arbitrary shapes5. Necessity to cope with very large datasets6. Regions should be properly ranked by relevance

(reward); in many application only the top-k regions are of interest

7. Design and implementation of clustering algorithms that are suitable to address challenges 1, 3, 4, 5 and 6.


3. Fitness Function for Hot Spot Discovery3. Fitness Function for Hot Spot Discovery

Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5

|c| 50 200 200 350 200

P(c, Unsafe) 20/50 = 40% 40/200 = 20% 10/200 = 5% 30/350 = 8.6% 100/200=50%

Reward

Class of Interest: Unsafe_Well

Prior Probability: 20%γ1 = 0.5, γ2 = 1.5;R+ = 1, R-= 1;β = 1.1, =1.

10% 30%

1.1507

1 1.1200*

2

1 1.1350*143.0 1.1200*7

20


4. Fitness Functions for Other Region 4. Fitness Functions for Other Region Discovery TasksDiscovery Tasks

4.1 Creating Contour Maps for Water Temperature (Temp)

1. Examples in the data set WT have the form: (x,y,temp); var(c,temp) denotes the variance of variable temp in region c

2. interestingness(c)=

IF var(c,temp)>var(WT,temp)

THEN 0

ELSE min(1, log20(var(WT,temp)/var(c,temp)))

with being a parameter (with default 1)

3. Basically, regions receive rewards if their variance is lower than the variance of the variable temperature for the whole data set, and regions whose variance is at least 20 times less receive the maximum reward of 1.

Fig. 1: Sea Surface Temperature on July 7 2002

Var=2.2Reward: 48,5

Rank: 3

A single region and its summary

Mean=11.2


4.2 Finding Regions with High Water Temperature Differences4.2 Finding Regions with High Water Temperature Differences

1. Examples in the data set WT have the form: (x,y,Temp)

2. Fitness function: Let c be a cluster to be evaluated

interestingness(c)=

IF var(c,temp)<var(WT,temp)

THEN 0

ELSE min(1, log20(var(c,temp)/var(WT,temp))) )

with being a parameter (with default 1)


4.3 Programming Project Fitness Functions Purity4.3 Programming Project Fitness Functions Purity

r1

r2(6, 2, 2)

(0, 0, 5)

We assume th=0.5 and =2

i(r1)= (0.6-0.5)**2=0.01i(r2)=(1-0.5)**2=0.25i(r3)=0

q(X)=q({r1,r2,r3})= 0.01*10+ 0.25*5

(2,2,1)

r3

We assume we have 3 classes; in r1 we have 6 objects of class1, 3 objects of class 2, and 2 objects of class1


Programming Project Fitness Functions VarianceProgramming Project Fitness Functions Variance

We assume =1 and b=10

i(r1)= 0i(r2)=log10(2)=0.3010i(r3)=1i(r4)=0

OVar(O)=100

r1var(r1)=80

r2Var(r2)=200

r3Var(r3)=1100

r4Var(r4)=20


Programming Project Function MSE Programming Project Function MSE

r1

r2(2,2) (4,4)

(-1,-1) (-7,-7) (-4,-4)

MSE(r1)=(1**2+1**2+1**2+1**2+1**2)/2=2

MSE(r2)=(3**2+3**2+3**2+3**2+1**2+0+0)/3=12


Global Co-location: and are co-located in the whole dataset

Task: Find Co-location patterns for the following data-set.

4.4 4.4 Regional Co-location MiningRegional Co-location Mining

RegionalCo-location

R1

R2

R3

R4


A Reward Function for Binary Co-locationA Reward Function for Binary Co-location

Task: Find regions in which the density of 2 or more classes is elevated. In general, multipliers C are computed for every region r, indicating how much the density of instances of class C is elevated in region r compared to C’s density in the whole space, and the interestness of a region with respect to two classes C1 and C2 is assessed proportional to the product C1C2

Example: Binary Co-Location Reward Framework;

C(r)=p(C,r)/prior(C)

C1,C2 = 1/((prior(C1)+prior(C2)) “maximum multiplier”

C1,C2(r) = IF C1(r)<1 or C2(r )<1 THEN 0

ELSE sqrt((C1(r)–1)*(C2(r)–1))/(C1,C2 –1)

interestingness(r)= maxC1,C2;C1C2 (C1,C2(c))


The Ultimate Vision of the Presented ResearchThe Ultimate Vision of the Presented Research

Spatial Databases

Data Set

DomainExpert

Measure ofInterestingnessAcquisition Tool

Fitness Function

Family ofClustering Algorithms

VisualizationTools

Ranked Set of Interesting Regions and their Properties

Region Discovery

Display

DatabaseIntegration

Tool

Architecture Region Discovery Engine


How to Apply the Suggested MethodologyHow to Apply the Suggested Methodology

1. With the assistance of domain experts determine structure of dataset to be used.

2. Acquire measure of interestingness for the problem of hand (this was purity, variance, MSE, probability elevation of two or more classes in the examples discussed before)

3. Convert measure of interestingness into a reward-based fitness function. The designed fitness function should assign a reward of 0 to “boring” regions. It is also a good idea to normalize rewards by limiting the maximum reward to 1.

4. After the region discovery algorithm has been run, rank and visualize the top k regions with respect to rewards obtained (interestingness(c)size(c)), and their properties which are usually task specific.


5. A Family of Clustering Algorithms for Region Discovery5. A Family of Clustering Algorithms for Region Discovery

1. Supervised Partitioning Around Medoids (SPAM). 2. Representative-based Clustering Using Randomized Hill

Climbing (CLEVER) 3. Supervised Clustering using Evolutionary Computing

(SCEC)4. Agglomerative Hierarchical Supervised Clustering (SCAH)5. Hierarchical Grid-based Supervised Clustering (SCHG)6. Supervised Clustering using Multi-Resolution Grids

(SCMRG)7. Representative-based Clustering with Gabriel Graph Based

Post-processing (MOSAIC)8. Supervised Clustering using Density Estimation

Techniques (SCDE)

Remark: For a more details about SCEC, SPAM, SRIDHCR see [EZZ04, ZEZ06]; the PKDD06 paper briefly discusses SCAH, SCHG, SCMRG


SCAH (Agglomerative Hierarchical) SCAH (Agglomerative Hierarchical)

Inputs:A dataset O={o1,...,on}A distance Matrix D = {d(oi,oj) | oi,oj O },Output:Clustering X={c1,…,ck}

Algorithm:1) Initialize: Create single object clusters: ci = {oi}, 1≤ i ≤ n; Compute merge candidates based on “nearest clusters”

2) DO FOREVER a) Find the pair (ci, cj) of merge candidates that improves q(X) the most

b) If no such pair exist terminate, returning X={c1,…,ck} c) Delete the two clusters ci and cj from X and add the cluster ci cj to X d) Update inter-cluster distances incrementally e) Update merge candidates based on inter-cluster distances


SCHG (Hierarchical Grid-based)SCHG (Hierarchical Grid-based)

Remark: Same as SCAH, but uses grid cells as initial clusters

Inputs:A dataset O={o1,...,on}A grid structure GOutput:Clustering X={c1,…,ck}

Algorithm:1) Initialize: Create clusters making each single non-empty grid cell a cluster Compute merge candidates (all pairs of neighboring grid cells)

2) DO FOREVER a) Find the pair (ci, cj) of merge candidates that improves q(X) the most

b) If no such pair exist terminate, returning X={c1,…,ck} c) Delete the two clusters ci and cj from X and add the cluster c’=ci cj to X d) Update merge candidates: cX (MC(c’,c) MC(c, ci) MC(c, cj ))

1 2 3

4 5

6 7


Ideas SCMRG (Divisive, Multi-Resolution Grids)Ideas SCMRG (Divisive, Multi-Resolution Grids)

Cell Processing Strategy

1. If a cell receives a reward that is larger than the sum of its rewards

its ancestors: return that cell.

2. If a cell and its ancestor do not receive any reward: prune

3. Otherwise, process the children of the cell (drill down)


Code SCMRGCode SCMRG


Problems with SCAHProblems with SCAH

No look ahead:

Non-contiguousclusters:

XXX OOO OOO XXXToo restrictive definition of merge candidates:


6. Summary6. Summary

1. A framework for region discovery that relies on additive, reward-based fitness functions and views region discovery as a clustering problem has been introduced.

2. Evidence concerning the usefulness of the framework for hot spot discovery problems has been presented.

3. As a by-product some known and not so well known flaws of hierarchical clustering algorithms have been identified.

4. The ultimate vision of this research is the development of region discovery engines that assist earth scientists in finding interesting regions in spatial datasets.


Why should people use Why should people use Region Discovery EnginesRegion Discovery Engines (RDE)(RDE)??

RDE: finds sub-regions with special characteristics in large spatial datasets and presents findings in an understandable form. This is important for:

• Focused summarization• Find interesting subsets in spatial datasets for further studies• Identify regions with unexpected patterns; because they are unexpected they deviate

from global patterns; therefore, their regional characteristics are frequently important for domain experts

• Without powerful region discovery algorithms, finding regional patters tends to be haphazard, and only leads to discoveries if ad-hoc region boundaries have enough resemblance with the true decision boundary

• Exploratory data analysis for a mostly unknown dataset• Co-location statistics frequently blurred when arbitrary region definitions are used,

hiding the true relationship of two co-occurring phenomena that become invisible by taking averages over regions in which a strong relationship is watered down, by including objects that do not contribute to the relationship (example: High crime-rates along the major rivers in Texas)

• Data set reduction; focused sampling

discovering interesting regions in spatial data sets

Documents