lecture 6 data mining dt786 semester 2 2011-12 pat browne

Lecture 6Lecture 6Data MiningData Mining

DT786 DT786 Semester 2 2011-12Semester 2 2011-12

Pat BrownePat Browne

Data Mining: OutlineData Mining: Outline Spatial DM compared to spatial statisticsSpatial DM compared to spatial statistics Background to SD, & spatial data mining (SDM).Background to SD, & spatial data mining (SDM). The DM processThe DM process Spatial autocorrelationSpatial autocorrelation i.e. the non independence of i.e. the non independence of

phenomena in a contiguous geographic area.phenomena in a contiguous geographic area. Spatial independenceSpatial independence Classical data mining concepts:Classical data mining concepts:

ClassificationClassification ClusteringClustering Association rulesAssociation rules

Spatial data mining using co-location RulesSpatial data mining using co-location Rules SummarySummary

Statistics versus Data MiningStatistics versus Data Mining

Do we know the statistical properties of data? Is data Do we know the statistical properties of data? Is data spatially clustered, dispersed, or random? spatially clustered, dispersed, or random?

Data mining is strongly related to statistical analysis.Data mining is strongly related to statistical analysis. Data mining can be seen as a filter (exploratory data Data mining can be seen as a filter (exploratory data

analysis) before applying a rigorous statistical tool. analysis) before applying a rigorous statistical tool. Data mining generates hypotheses that are then Data mining generates hypotheses that are then

verified (sometimes too many!). verified (sometimes too many!). The filtering process does not guarantee The filtering process does not guarantee

completeness (wrong elimination or missing data).completeness (wrong elimination or missing data).

Data MiningData Mining

Data mining is the process of discovering Data mining is the process of discovering interesting and potentially useful patterns of interesting and potentially useful patterns of information embedded in large information embedded in large databasesdatabases. .

Spatial data mining has the same goals as Spatial data mining has the same goals as conventional data mining but requires additional conventional data mining but requires additional techniques that are tailored to the spatial techniques that are tailored to the spatial domain.domain.

A key goal of spatial data mining is to A key goal of spatial data mining is to partially partially automate knowledge discoveryautomate knowledge discovery, i.e., search for , i.e., search for “nuggets” of information embedded in very large “nuggets” of information embedded in very large quantities of spatial data.quantities of spatial data.

Data MiningData Mining

Data mining lies at the intersection of database Data mining lies at the intersection of database management, statistics, machine learning and management, statistics, machine learning and artificial intelligence. DM provides semi-artificial intelligence. DM provides semi-automatic techniques for discovering automatic techniques for discovering unexpected patterns in very large data sets. unexpected patterns in very large data sets.

We must distinguish between operational We must distinguish between operational systems (e.g. bank account transactions) and systems (e.g. bank account transactions) and decision support systems (e.g. data mining). DM decision support systems (e.g. data mining). DM can support decision making.can support decision making.

Spatial Data MiningSpatial Data Mining

SDM can be characterised by Tobler’s first SDM can be characterised by Tobler’s first law of geography (near things tend to be law of geography (near things tend to be more related than far things). Which more related than far things). Which means that the standard DM assumptions means that the standard DM assumptions that values are that values are independentlyindependently and and identicallyidentically distributed does not hold in distributed does not hold in spatially dependent data spatially dependent data (SDD). The term (SDD). The term spatial autocorrelationspatial autocorrelation captures this captures this property and augments standard DM property and augments standard DM techniques for SDM.techniques for SDM.


The important techniques in conventional DM The important techniques in conventional DM are association rules, clustering, classification, are association rules, clustering, classification, and regression. These techniques need to be and regression. These techniques need to be modified for spatial DM. Two approaches used modified for spatial DM. Two approaches used when adapting DM techniques to the spatial when adapting DM techniques to the spatial domain:domain: 1)Adjust the underlying (iid) statistical model1)Adjust the underlying (iid) statistical model 2)Include an o2)Include an objective functionbjective function11 (some f(x) that we (some f(x) that we

wish to maximize or minimize which drives the wish to maximize or minimize which drives the search) that is modified to include a spatial term.search) that is modified to include a spatial term.


Size of spatial data sets:Size of spatial data sets: NASA’s Earth Orbiting Satellites capture about a NASA’s Earth Orbiting Satellites capture about a

terabyte(10terabyte(101212) a day, YouTube 2008 = 6 terabytes.) a day, YouTube 2008 = 6 terabytes. Environmental agencies, utilities (e.g. ESB), Central Environmental agencies, utilities (e.g. ESB), Central

Statistics Office, government departments such as Statistics Office, government departments such as health/agriculture, and local authorities all have large health/agriculture, and local authorities all have large spatial data sets.spatial data sets.

It is very difficult to analyse such large data sets It is very difficult to analyse such large data sets manually or using only SQL.manually or using only SQL.

For examples see Chapter 7 from SDTFor examples see Chapter 7 from SDT

Data Mining: Sub-processesData Mining: Sub-processes

Data mining involves many sub-process:Data mining involves many sub-process: Data collection: usually data was collected as Data collection: usually data was collected as

part of the operational activities of an part of the operational activities of an organization, rather than specifically for the data organization, rather than specifically for the data mining task. It is unlikely that the data mining mining task. It is unlikely that the data mining requirements were considered during data requirements were considered during data collection.collection.

Data extraction/cleaning: data must be extracted Data extraction/cleaning: data must be extracted & cleaned for the specific data mining task.& cleaned for the specific data mining task.

Data Mining: Sub-processesData Mining: Sub-processes

Feature selection.Feature selection. Algorithm design.Algorithm design. Analysis of outputAnalysis of output Level of aggregation at which the data is Level of aggregation at which the data is

being analysed must be decided. Identical being analysed must be decided. Identical experiments at different levels of scale can experiments at different levels of scale can sometimes lead to contradictory results sometimes lead to contradictory results (e.g. the choice of basic spatial unit can (e.g. the choice of basic spatial unit can influence the results of a social survey).influence the results of a social survey).

Geographic Data mining processGeographic Data mining process

Close interaction between Domain Expert & Data-Mining Analyst

The output consists of hypotheses (data patterns) which can be verified with statistical tools and visualised using a GIS.

The analyst can interpret the patterns recommend appropriate actions

Unique features of spatial data Unique features of spatial data miningmining

The difference between classical & spatial The difference between classical & spatial data mining parallels the difference data mining parallels the difference between classical & spatial statistics.between classical & spatial statistics.

Statistics assumes the samples are Statistics assumes the samples are independently generated, which is independently generated, which is generally not the case with SDD.generally not the case with SDD.

Like things tend to cluster together.Like things tend to cluster together. Change tends to be gradual over space.Change tends to be gradual over space.

Non-Spatial Descriptive Data Non-Spatial Descriptive Data MiningMining

Descriptive analysisDescriptive analysis is an analysis that results is an analysis that results in some description or summarization of data. It in some description or summarization of data. It characterizes the properties of the data by characterizes the properties of the data by discovering patterns in the data, which would be discovering patterns in the data, which would be difficult for the human analyst to identify by eye difficult for the human analyst to identify by eye or by using standards statistical techniques. or by using standards statistical techniques. Description involves identifying rules or models Description involves identifying rules or models that describe data. Both that describe data. Both clusteringclustering and and association rulesassociation rules are descriptive techniques are descriptive techniques employed by supermarket chains. employed by supermarket chains.

Non-Spatial Data MiningNon-Spatial Data Mining


ClusteringClustering (unsupervised learning) is a (unsupervised learning) is a descriptive data mining technique. Clustering is descriptive data mining technique. Clustering is the task of assigning cases into groups of cases the task of assigning cases into groups of cases (clusters) so that the cases within a group are (clusters) so that the cases within a group are similar to each other and are as different as similar to each other and are as different as possible from the cases in other groups. possible from the cases in other groups. Clustering can identify groups of customers with Clustering can identify groups of customers with similar buying patterns and this knowledge can similar buying patterns and this knowledge can be used to help promote certain products. be used to help promote certain products. Clustering can help locate what are the crime Clustering can help locate what are the crime ‘hot spots’ in a city.‘hot spots’ in a city.

Clustering using Similarity graphs Clustering using Similarity graphs Problem: grouping objects into similarity Problem: grouping objects into similarity

classes based on various properties of the classes based on various properties of the objects.objects. For example, consider computer For example, consider computer programs that implement the same programs that implement the same algorithm have algorithm have kk properties ( properties (k =k = <1, 2, 3><1, 2, 3> ) ) 1. Number of lines in the program1. Number of lines in the program 2. Number of “GOTO” statements2. Number of “GOTO” statements 3. Number of function calls3. Number of function calls

Clustering using Similarity graphsClustering using Similarity graphsSuppose five programs are compared using Suppose five programs are compared using

three attributes:three attributes:

ProgramProgram # lines# lines # GOTOS# GOTOS # functions# functions

11 6666 2020 11

22 4141 1010 22

33 6868 55 88

44 9090 3434 55

55 7575 1212 1414

Clustering using Similarity graphs.Clustering using Similarity graphs.

A graph G is constructed as follows:A graph G is constructed as follows: V(G) is the set of programs {vV(G) is the set of programs {v11, v, v22, v, v33, v , v 44, v, v55 }. }.

Each vertex vEach vertex vii is assigned a triple (p is assigned a triple (p11, p, p22, p, p33),),

where pwhere pkk is the value of property k = 1, 2, or 3 is the value of property k = 1, 2, or 3

vv11 = (66,20,1) = (66,20,1)

vv22 = (41, 10, 2) = (41, 10, 2)

vv33 = (68, 5, 8) = (68, 5, 8)

vv44 = (90, 34, 5) = (90, 34, 5)

vv55 = (75, 12, 14) = (75, 12, 14)

Vertices not accurately positioned.

Clustering using Similarity graphs.Clustering using Similarity graphs. Define a Define a dissimilarity functiondissimilarity function as follows: as follows:

For each pair of vertices For each pair of vertices vv = (p = (p11, p, p22, p, p33),),ww = (q = (q11, q, q22, q, q33) )

33

s(s(vv,,ww) = ) = |p |pkk – q – qkk||

k = 1k = 1

s(v,w) is a measure of s(v,w) is a measure of dissimilaritydissimilarity between any between any two programs v and wtwo programs v and w

Fix a number N. Insert an edge between v and w if Fix a number N. Insert an edge between v and w if s(v,w) < N. Then:s(v,w) < N. Then:

We say that v and w are We say that v and w are in the same classin the same class if v = w if v = w or if there is a path between v and w.or if there is a path between v and w.

Clustering using Similarity Clustering using Similarity graphs.graphs.

If we let If we let vivi correspond to program correspond to program ii::

s(v1,v2) = 36

s(v1,v3) = 24

s(v1,v4) = 42

s(v1,v5) = 30

s(v2,v3) = 38

s(v2,v4) = 76

s(v2,v5) = 48

s(v3,v4) = 54

s(v3,v5) = 20

s(v4,v5) = 46

s(v1,v2)= =|66-41|+|20-10|+|1-2| = 36

Clustering using Similarity graphs.Clustering using Similarity graphs.

Let N = 25.Let N = 25.

s(vs(v11,v,v33) = 24, s(v) = 24, s(v33,v,v55) = 20 and all other s(v) = 20 and all other s(vii,v,vjj) )

> 25> 25 There are three classes:There are three classes:

{v{v11,v,v33, v, v55}, {v}, {v22} and {v} and {v44}}

The The similarity graph =similarity graph =

Dissimilarity matrix in RDissimilarity matrix in R library('cluster')library('cluster') data <- data <-

matrix(c(66,20,1,41,10,2,68,5,8,90,34,5,75,12,14),ncol=3,byrow=TRmatrix(c(66,20,1,41,10,2,68,5,8,90,34,5,75,12,14),ncol=3,byrow=TRUE)UE)

diss <- daisy(data,metric = "manhattan")diss <- daisy(data,metric = "manhattan") Dissimilarities :Dissimilarities : 1 2 3 41 2 3 42 36 2 36 3 24 38 3 24 38 4 42 76 54 4 42 76 54 5 30 48 20 465 30 48 20 46 Metric : manhattan Metric : manhattan Number of objects : 5Number of objects : 5


Association RulesAssociation Rules. Association rule . Association rule discovery (ARD) identifies the discovery (ARD) identifies the relationships within data.relationships within data. The rule can be The rule can be expressed as a predicate in the form (IF expressed as a predicate in the form (IF x x THEN THEN y y ). ARD can identify product lines ). ARD can identify product lines that are bought together in a single that are bought together in a single shopping trip by many customers and this shopping trip by many customers and this knowledge can be used to help decide on knowledge can be used to help decide on the layout of the product lines. We will look the layout of the product lines. We will look at ARD in detail later.at ARD in detail later.

Non-Spatial Predictive Non-Spatial Predictive Data MiningData Mining

Predictive DMPredictive DM results in some description results in some description or summarization of a sample of data or summarization of a sample of data which predicts the form of unobserved which predicts the form of unobserved data. Prediction involves building a set of data. Prediction involves building a set of rules or a model that will enable unknown rules or a model that will enable unknown or future values of a variable to be or future values of a variable to be predicted from known values of another predicted from known values of another variable.variable.

Classification Non-Spatial Classification Non-Spatial Predictive Predictive Data MiningData Mining

ClassificationClassification is a predictive data mining technique. is a predictive data mining technique. Classification is the task of finding a model that maps Classification is the task of finding a model that maps (classifies) each case into one of several (classifies) each case into one of several predefined predefined classesclasses. . The goal of classification is to estimate the The goal of classification is to estimate the value of an attribute of a relation based on the value of value of an attribute of a relation based on the value of the relation’s other attribute. the relation’s other attribute.

Uses:Uses: Classification is used in risk assessment in the insurance Classification is used in risk assessment in the insurance

industry. industry. Determining the location of nests based on the values of Determining the location of nests based on the values of

vegetation durability & water depth is vegetation durability & water depth is a location prediction a location prediction problem problem (classification nest or no nest). (classification nest or no nest).

Classifying the pixels of a satellite image into various thematic Classifying the pixels of a satellite image into various thematic classes such as water, forest, or agricultural is classes such as water, forest, or agricultural is a thematic a thematic classification problemclassification problem..


A classifier can choose a hyperplane that A classifier can choose a hyperplane that best classifies the data.best classifies the data.

Classification techniquesClassification techniques

A classification function, A classification function, f : D -> Lf : D -> L, maps a domain , maps a domain DD consisting of one or more variables (e.g. consisting of one or more variables (e.g. vegetation vegetation durabilitydurability, , water depthwater depth, , distance to open distance to open waterwater) to a set of labels ) to a set of labels LL (e.g. (e.g. nestnest or or not-nestnot-nest).).

The goal of the classification is to determine the The goal of the classification is to determine the appropriate appropriate ff, from a finite subset , from a finite subset Train Train D D L L..

Accuracy of Accuracy of ff determined on determined on Test Test which is disjoint which is disjoint from from TrainTrain..

The classification problem is known as predictive The classification problem is known as predictive modelling because it is used to predict the labels modelling because it is used to predict the labels LL from from DD..

Non-Spatial Predictive Non-Spatial Predictive Data MiningData Mining

RegressionRegression analysis is a predictive data analysis is a predictive data mining technique that uses a model to mining technique that uses a model to predict a value. Regression can be used predict a value. Regression can be used to predict sales of new product lines based to predict sales of new product lines based on advertising expenditure.on advertising expenditure.

Case StudyCase Study

Data from 1995 & 1996 concerning two wetlands Data from 1995 & 1996 concerning two wetlands on the shores of Lake Erie, USA.on the shores of Lake Erie, USA.

Using this information we want to predict the Using this information we want to predict the spatial distribution of marsh breeding bird called spatial distribution of marsh breeding bird called the red-winged black bird. Where will they build the red-winged black bird. Where will they build nests? What conditions do they favour?nests? What conditions do they favour?

A uniform grid (pixel=5 square metres) was A uniform grid (pixel=5 square metres) was superimposed on the wetland.superimposed on the wetland.

Seven attributes were recorded.Seven attributes were recorded. See linkSee link11 to Spatial Databases a Tour for details. to Spatial Databases a Tour for details.


Significance of three key variables Significance of three key variables established with statistical analysis.established with statistical analysis.

Vegetation durabilityVegetation durability Distance to open waterDistance to open water Water depthWater depth


Nest locations Distance to open water

Vegetation durabilityWater depth

Example showing different predictions: (a) the actual locations of nests; (b) pixels with actual nests; (c) locations predicted by one model; and (d) locations predicted by another model. Prediction (d) is spatially more accurate than (c).

Classical statistical assumptions do Classical statistical assumptions do notnot hold for spatially dependent hold for spatially dependent

datadata


The previous maps illustrate two important The previous maps illustrate two important features of spatial data:features of spatial data:

Spatial Autocorrelation (Spatial Autocorrelation (not independentnot independent)) Spatial data is not Spatial data is not identicallyidentically distributed. distributed. Two random variables are identically Two random variables are identically

distributed if and only if they have the distributed if and only if they have the same probability distribution.same probability distribution.

Spatial DBs needs to augment Spatial DBs needs to augment classical DM techniques because:classical DM techniques because:

Rich data types (e.g., extended spatial objects)

Implicit spatial relationships among the variables,

Observations that are not independent, Spatial autocorrelation exists among the

values of the attributes of physical locations or features.

Classical Data MiningClassical Data Mining

Association rulesAssociation rules: Determination of interaction between attributes. : Determination of interaction between attributes. For example: For example: X X Y: Y:

ClassificationClassification: Estimation of the attribute of an entity in terms of : Estimation of the attribute of an entity in terms of attribute values of another entity. Some applications are:attribute values of another entity. Some applications are:

Predicting locations (shopping centers, habitat, crime zones)Predicting locations (shopping centers, habitat, crime zones) Thematic classification (satellite images)Thematic classification (satellite images)

ClusteringClustering: Unsupervised learning, where classes and the number : Unsupervised learning, where classes and the number of classes are unknown. Uses similarity criterion. Applications: of classes are unknown. Uses similarity criterion. Applications: Clustering pixels from a satellite image on the basis of their spectral Clustering pixels from a satellite image on the basis of their spectral signature, identifying hot spots in crime analysis and disease signature, identifying hot spots in crime analysis and disease tracking.tracking.

Regression:Regression: takes a numerical dataset and develops a takes a numerical dataset and develops a mathematical formula that fits the data. The results can be used to mathematical formula that fits the data. The results can be used to predict future behavior. Works well with continuous quantitative data predict future behavior. Works well with continuous quantitative data like weight, speed or age. Not good for categorical data where like weight, speed or age. Not good for categorical data where order is not significant, like colour, name, gender, nest/no nest.order is not significant, like colour, name, gender, nest/no nest.

Determining the Interaction among Determining the Interaction among AttributesAttributes

We wish to discovery relationships between We wish to discovery relationships between attributes of a relation. Examples:attributes of a relation. Examples:

is_close(house,beach) -> is_expensive(house)is_close(house,beach) -> is_expensive(house)

low(vegetationDurability) -> low(vegetationDurability) -> high(stem density)high(stem density)

Associations & association rules are often used Associations & association rules are often used to select subsets of features for more rigorous to select subsets of features for more rigorous statistical correlation analysis.statistical correlation analysis.

In probabilistic terms an association rule X->Y is In probabilistic terms an association rule X->Y is an expression in conditional probability P(Y|X).an expression in conditional probability P(Y|X).

P(X|Y) = P(X P(X|Y) = P(X Y)/P(Y) Y)/P(Y) (probability of X, given Y)(probability of X, given Y)

Spatial Association rulesSpatial Association rules

is_a(x, big_town) /\ intersect(x, highway) -> adjacent_to(x, river) [support=7%, confidence=85%] The relative frequency with which an antecedent The relative frequency with which an antecedent

appears in a database is called its appears in a database is called its support support (other definitions possible).(other definitions possible).

The The confidenceconfidence of a rule A->B is the conditional of a rule A->B is the conditional probability of B given A. Using probability probability of B given A. Using probability notation: notation: confidence(A implies B)confidence(A implies B) = P (B | A). = P (B | A).

implies

Antecedent, AKA: hypotheses, assumptions, premises

Conclusion or

Consequence

How does data mining differ from How does data mining differ from conventional methods of data analysis?conventional methods of data analysis? Using conventional data analysis the analyst formulates Using conventional data analysis the analyst formulates

and refines the hypothesis. This is known as and refines the hypothesis. This is known as hypothesis hypothesis verificationverification, which is an approach to identifying patterns , which is an approach to identifying patterns in data where a human analyst formulates and refines in data where a human analyst formulates and refines the hypothesis. For example "Did the sales of cream the hypothesis. For example "Did the sales of cream increase when strawberries were available?"increase when strawberries were available?"

Using data mining the hypothesis is formulated and Using data mining the hypothesis is formulated and refined without human input. This approach is known as refined without human input. This approach is known as hypothesis generationhypothesis generation, identifying patterns in that data , identifying patterns in that data where the hypotheses are automatically formulated and where the hypotheses are automatically formulated and refined. refined. Knowledge discoveryKnowledge discovery is where the data mining is where the data mining tool formulates and refines the hypothesis by identifying tool formulates and refines the hypothesis by identifying patterns in the data. For example, "What are the factors patterns in the data. For example, "What are the factors that determine the sales of cream?"that determine the sales of cream?"

Association ruleAssociation ruless

An association rule is a pattern that can An association rule is a pattern that can be expressed as a predicate in the form be expressed as a predicate in the form (IF (IF x x THEN THEN y y ), where ), where xx and and yy are are conditions (about conditions (about casescases), which state if ), which state if xx (the (the antecedentantecedent) occurs then, in most ) occurs then, in most cases, so will cases, so will y (y (thethe consequence) consequence). The . The antecedent may contain several conditions antecedent may contain several conditions but the consequence but the consequence usuallyusually contains only contains only one term. one term.


Association rules need to be discovered. Rule Association rules need to be discovered. Rule discovery is data mining technique that identifies discovery is data mining technique that identifies relationships within data. In the non-spatial case relationships within data. In the non-spatial case rule discovery is usually employed to discover rule discovery is usually employed to discover relationships within transactions or between relationships within transactions or between transactions in operational data. The relative transactions in operational data. The relative frequency with which an antecedent appears in frequency with which an antecedent appears in a database is called its a database is called its supportsupport. High support is . High support is the frequency at which the relative frequency is the frequency at which the relative frequency is considered significant and is called the support considered significant and is called the support threshold (say 70%)threshold (say 70%)


ExampleExample: Market basket analysis is form : Market basket analysis is form of association rule discovery that of association rule discovery that discovers relationships in the purchases discovers relationships in the purchases made by a customer during a single made by a customer during a single shopping trip. An shopping trip. An itemsetitemset in the context of in the context of market basket analysis is the set of items market basket analysis is the set of items found in a customer’s shopping basket.found in a customer’s shopping basket.


Association rules need to be discovered. Rule Association rules need to be discovered. Rule discovery is data mining technique that identifies discovery is data mining technique that identifies relationships within data. In the non-spatial case relationships within data. In the non-spatial case rule discovery is usually employed to discover rule discovery is usually employed to discover relationships within transactions or between relationships within transactions or between transactions in operational data. The relative transactions in operational data. The relative frequency with which an antecedent appears in frequency with which an antecedent appears in a database is called its a database is called its support support (alternatively, (alternatively, fraction of transactions satisfying the rule). High fraction of transactions satisfying the rule). High support is the frequency at which the relative support is the frequency at which the relative frequency is considered significant and is called frequency is considered significant and is called the support threshold (say 70%)the support threshold (say 70%)


ExampleExample: Market basket analysis is form : Market basket analysis is form of association rule discovery that of association rule discovery that discovers relationships in the purchases discovers relationships in the purchases made by a customer during a single made by a customer during a single shopping trip. An shopping trip. An itemsetitemset in the context of in the context of market basket analysis the set of items market basket analysis the set of items found in a customer’s shopping basket.found in a customer’s shopping basket.

Item SetItem Set

An An itemsetitemset in the context of market basket in the context of market basket analysis is the set of items found in a customer’s analysis is the set of items found in a customer’s shopping basket (or order). A shopping basket (or order). A generalgeneral form of form of association rule is if (association rule is if (x1x1 and and x2x2 and .. and .. xnxn THEN THEN y1y1 and and y2 y2 and .. and .. y3y3). In market basket analysis ). In market basket analysis the set of items (the set of items (x1x1 and and x2x2 and .. and .. xnxn and and y1y1 and and y2 y2 and .. and .. y3y3) is called the ) is called the itemsetitemset. We are . We are only interested in itemsets with high support (i.e. only interested in itemsets with high support (i.e. they appear together in many baskets). they appear together in many baskets).

Frequent Item SetFrequent Item Set

We then find association rules involving itemsets We then find association rules involving itemsets that appear in at least a certain percentage of that appear in at least a certain percentage of the shopping baskets called the support the shopping baskets called the support threshold (i.e. frequency at which the threshold (i.e. frequency at which the appearance of an itemset in a shopping basket appearance of an itemset in a shopping basket is considered significant). An itemset that is considered significant). An itemset that appears in a percentage of baskets at or above appears in a percentage of baskets at or above the support threshold is called the the support threshold is called the frequent frequent itemsetitemset..

A candidate itemset is potentially a frequent A candidate itemset is potentially a frequent itemsetitemset

A-Priori algorithm A-Priori algorithm

A-Priori use iterative level-wise search A-Priori use iterative level-wise search where k-itmsets are used to explore k+1 where k-itmsets are used to explore k+1 itemsets. itemsets.

First the set of frequent 1-itemset is found. First the set of frequent 1-itemset is found. This is used to find the set of frequent 2-This is used to find the set of frequent 2-itemset, and so on until no more k-itemset itemset, and so on until no more k-itemset can be found. An itemset of k items is can be found. An itemset of k items is called a k-itemset. called a k-itemset.


The algorithm follows a two stage The algorithm follows a two stage process.process.

1) Find the k-itemset that is at or above 1) Find the k-itemset that is at or above the support threshold giving the frequent the support threshold giving the frequent k-itemset. If none is fond stop, otherwise.k-itemset. If none is fond stop, otherwise.

2) Generate the k+1 itemset from the k-2) Generate the k+1 itemset from the k-itemset. Goto 1.itemset. Goto 1.


A) The first iteration generates candidate A) The first iteration generates candidate 1-itemsets.1-itemsets.

B) The frequent 1-itemsets are selected B) The frequent 1-itemsets are selected from the candidate 1-itemsets that satisfy from the candidate 1-itemsets that satisfy the minimum support.the minimum support.

C) The second iteration generates C) The second iteration generates candidate 2-itemsets from the frequent 1-candidate 2-itemsets from the frequent 1-itemsets. All possible pairs are checked to itemsets. All possible pairs are checked to determine the frequency of each pair.determine the frequency of each pair.


D) The frequent 2-itemsets are determined by D) The frequent 2-itemsets are determined by selecting those candidate 2-itemsets that satisfy selecting those candidate 2-itemsets that satisfy the minimum support.the minimum support.

E) The third iteration generates candidate 3-E) The third iteration generates candidate 3-itemsets from the frequent 2-itemsets. All itemsets from the frequent 2-itemsets. All possible triples are checked to determine the possible triples are checked to determine the frequency of each triple.frequency of each triple.

F) The frequent 3-itemsets are determined by F) The frequent 3-itemsets are determined by selecting those candidate 3-itemsets that satisfy selecting those candidate 3-itemsets that satisfy the minimum support. There are none, the minimum support. There are none, terminate.terminate.

A-Priori algorithm : ExampleA-Priori algorithm : Example

A retail chain wishes to determine whether the A retail chain wishes to determine whether the five product lines, identified by the product code five product lines, identified by the product code I1, I2, I3, I4 and I5 are often purchased together I1, I2, I3, I4 and I5 are often purchased together by a customer on the same shopping trip. The by a customer on the same shopping trip. The next slide shows a summary of the transactions. next slide shows a summary of the transactions. The support threshold is the frequency at which The support threshold is the frequency at which the appearance of an the appearance of an itemsetitemset in a shopping in a shopping basket is considered significant, in this case it is basket is considered significant, in this case it is 2000.2000.

Find the frequent itemsets and generate the Find the frequent itemsets and generate the association rules using the A-Priori algorithm.association rules using the A-Priori algorithm.

A-Priori algorithm : ExampleA-Priori algorithm : Example

R: itemFrequencyPlot(trans,type="absolute")R: itemFrequencyPlot(trans,type="absolute")

Association Rules: A prioriAssociation Rules: A priori

Principle: If an item set has a high support, then so do all its Principle: If an item set has a high support, then so do all its subsets.subsets.

The steps of the algorithm is as follows:The steps of the algorithm is as follows: first,discover all 1-itemsets that are frequentfirst,discover all 1-itemsets that are frequent combine to form 2-itemsets and analyze for frequent setcombine to form 2-itemsets and analyze for frequent set go on until no more itemsets exceed the threshold.go on until no more itemsets exceed the threshold. search for rulessearch for rules

Association rules & Spatial Association rules & Spatial DomainDomain

Differences with respect to spatial domain:Differences with respect to spatial domain:

1.1. The notion of The notion of transactiontransaction or case does not exist, since data or case does not exist, since data are immerse in a continuous space.The partition of the are immerse in a continuous space.The partition of the space may introduce errors with respect to overestimation space may introduce errors with respect to overestimation or sub-estimation confidences. The notion of transaction is or sub-estimation confidences. The notion of transaction is replaced by replaced by neighborhoodneighborhood. .

2.2. The size of itemsets is less in the spatial domain. Thus, the The size of itemsets is less in the spatial domain. Thus, the cost of generating candidate is not a dominant factor. The cost of generating candidate is not a dominant factor. The enumeration of neighbours dominates the final enumeration of neighbours dominates the final computational cost. computational cost.

3.3. In most cases, the spatial items are discrete version of In most cases, the spatial items are discrete version of continuous variables. continuous variables.

Spatial Association RulesSpatial Association Rules

Table 7.5 shows examples of association Table 7.5 shows examples of association rules, support, and confidence that were rules, support, and confidence that were discovered in Darr 1995 wetland data.discovered in Darr 1995 wetland data.

Co-Location rulesCo-Location rules

Colocation rules attempt to generalise association rules to Colocation rules attempt to generalise association rules to point collection data sets that are indexed by space. The point collection data sets that are indexed by space. The colocation pattern discovery process finds frequently co-colocation pattern discovery process finds frequently co-located subsets of spatial event types given a map of their located subsets of spatial event types given a map of their locations, see Figure 7.12 in SDAT.locations, see Figure 7.12 in SDAT.

Co-location ExamplesCo-location Examples

(a) Illustration of Point Spatial Co-location Patterns. Shapes represent different spatial feature types. Spatial features in sets {`+,x} and {o,*} tend to be located together.

(b) Illustration of Line String Co-location Patterns. Highways and frontage roads1 are co-located , e.g., Hwy100 is near frontage road Normandale Road.

Answers: and

Two co-location patterns

Spatial Association RulesSpatial Association Rules

A spatial association rule is a rule indicating certain A spatial association rule is a rule indicating certain association relationship among a set of spatial and possibly association relationship among a set of spatial and possibly some non-spatial predicates.some non-spatial predicates.

Spatial association rules (SPAR) are defined in terms of Spatial association rules (SPAR) are defined in terms of spatial predicates rather than item.spatial predicates rather than item.

PP11 P P22 .. .. P Pnn Q Q11 .. .. Q Qmm

Where at least one of the terms (Where at least one of the terms (PP or or QQ) is a spatial ) is a spatial predicate.predicate.

is(x,country)is(x,country)touches(x,Mediterranean)touches(x,Mediterranean)

is(x,wine-exporter)is(x,wine-exporter)

Co-location V Association RulesCo-location V Association Rules


Transactions are disjoint while spatial co-location is not. Something must be done. Three main options 1. Divide the space into areas and treat them

as transactions 2. Choose a reference point pattern and treat

the neighbourhood of each of its points as a transaction

3. Treat all point patterns as equal

Co-locationCo-location

Co-locationCo-location

The participation ratio (support) is the number of row instances of co-location C divided by number of instances of Fi. The participation index (confidence) measures the implication strength of a pattern from spatial features in the pattern.

Co-location Co-location


Spatial Association Rules Mining (SARM) Spatial Association Rules Mining (SARM) is similar to the raster view in the sense is similar to the raster view in the sense that it tessellates a study region that it tessellates a study region S S into into discrete groups based on spatial or discrete groups based on spatial or aspatial predicates derived from concept aspatial predicates derived from concept hierarchies. For instance, a hierarchies. For instance, a spatial spatial predicatepredicate close_toclose_to((α, βα, β) divides ) divides S S into two into two groups, locations close to groups, locations close to β β and those not. and those not.


So, So, close_toclose_to((α, βα, β) can be either true or false ) can be either true or false depends on depends on αα’s closeness to ’s closeness to ββ. A spatial . A spatial association rule is a rule that consists of a set of association rule is a rule that consists of a set of predicates in which at least a predicates in which at least a spatial predicatespatial predicate is involved. For instance, is involved. For instance,

is_ais_a((α, houseα, house) and ) and close_toclose_to((α, beachα, beach) ->) ->

expensiveexpensive((αα). ). This approach efficiently mines large datasets This approach efficiently mines large datasets

using a progressive deepening approach.using a progressive deepening approach.

DM Summary DM Summary Data mining Data mining is the process of finding significant is the process of finding significant

previously unknown, and potentially valuable knowledge previously unknown, and potentially valuable knowledge hidden in data. hidden in data. DM DM seeks to reveal useful and often seeks to reveal useful and often novel patterns and relationships in the raw and novel patterns and relationships in the raw and summarized data in the warehouse in order to solve summarized data in the warehouse in order to solve business problems. The answers are not pre-determined business problems. The answers are not pre-determined but often discovered through exploratory methods. Not but often discovered through exploratory methods. Not usually part of operational systems (day-to-day) but usually part of operational systems (day-to-day) but rather a decision support system (sometimes once off). rather a decision support system (sometimes once off). The variety of data mining methods include intelligent The variety of data mining methods include intelligent agents, expert systems, fuzzy logic, neural networks, agents, expert systems, fuzzy logic, neural networks, exploratory data analysis, descriptive DM, predictive DM exploratory data analysis, descriptive DM, predictive DM and data visualization. Closely related to Spatial and data visualization. Closely related to Spatial Statistics (e.g. Moran's I). Statistics (e.g. Moran's I).

Summary Summary DM, predictive DM and data visualization. Closely related to Spatial DM, predictive DM and data visualization. Closely related to Spatial

Statistics (e.g. Moran's I). The methods are able to intensively Statistics (e.g. Moran's I). The methods are able to intensively explore large amounts data for patterns and relationships, and to explore large amounts data for patterns and relationships, and to identify potential answers to complex business problems. Some of identify potential answers to complex business problems. Some of the areas of application are risk analysis, quality control, and fraud the areas of application are risk analysis, quality control, and fraud detection. There are several ways GIS and spatial techniques can detection. There are several ways GIS and spatial techniques can be incorporated in data mining. Pre-DM, a data warehouse can be be incorporated in data mining. Pre-DM, a data warehouse can be spatially partitioned, so the data mining is selectively applied to spatially partitioned, so the data mining is selectively applied to certain geographies (e.g. location or theme). During the data mining certain geographies (e.g. location or theme). During the data mining process, algorithms can be modified to incorporate spatial methods. process, algorithms can be modified to incorporate spatial methods. For instance, correlations can be adjusted for spatial autocorrelation For instance, correlations can be adjusted for spatial autocorrelation (or correlation across space and time), and cluster analysis can add (or correlation across space and time), and cluster analysis can add spatial indices, association rules can be adapted to generate co-spatial indices, association rules can be adapted to generate co-location inferences.. After data mining, patterns and relationships location inferences.. After data mining, patterns and relationships identified in the data can be mapped with GIS software.identified in the data can be mapped with GIS software.

Summary Summary

DM Examples co-location , location DM Examples co-location , location predictionprediction

Application of SDM: The generation of co-Application of SDM: The generation of co-location rules. Determining the location of location rules. Determining the location of nests based on the values of vegetation nests based on the values of vegetation durability & water depth is a location durability & water depth is a location prediction problem.prediction problem.

AR-Summary AR-Summary

Association RulesAssociation Rules. An association rule can be expressed . An association rule can be expressed as a predicate in the form (IF x1,x2.. THEN y1,y2.. ) as a predicate in the form (IF x1,x2.. THEN y1,y2.. ) where {xi,yi} are called itemsets (e.g. items in a shopping where {xi,yi} are called itemsets (e.g. items in a shopping basket). The AR algorithm takes a list of itemsets as basket). The AR algorithm takes a list of itemsets as intput and produces a set of rules each with a confidence intput and produces a set of rules each with a confidence measure. measure. Association rule discovery Association rule discovery (ARD) identifies the (ARD) identifies the relationships within data. ARD can identify product lines relationships within data. ARD can identify product lines that are bought together in a single shopping trip by that are bought together in a single shopping trip by many customers and this knowledge can be used to by a many customers and this knowledge can be used to by a supermarket chain to help decide on the layout of the supermarket chain to help decide on the layout of the product lines.product lines.

AR-Summary AR-Summary Association rules characterized by confidence Association rules characterized by confidence

and support.and support.

AR and co-locationAR and co-location DM Example co-location , location predictionDM Example co-location , location prediction Application of SDM: The generation of co-location rules. Application of SDM: The generation of co-location rules.

Determining the location of nests based on the values of vegetation Determining the location of nests based on the values of vegetation durability & water depth is a location prediction problem.durability & water depth is a location prediction problem.

Co-location is the presence of two or more spatial objects at the Co-location is the presence of two or more spatial objects at the same location or at significantly close distances from each other. same location or at significantly close distances from each other. Co-location patterns can indicate interesting associations among Co-location patterns can indicate interesting associations among spatial data objects with respect to their non-spatial attributes. For spatial data objects with respect to their non-spatial attributes. For example, a data mining application could discover that sales at example, a data mining application could discover that sales at franchises of a specific pizza restaurant chain were higher at franchises of a specific pizza restaurant chain were higher at restaurants co-located with video stores than at restaurants not co-restaurants co-located with video stores than at restaurants not co-located with video stores.located with video stores.

In probabilistic terms an association rule X->Y is an expression in In probabilistic terms an association rule X->Y is an expression in conditional probability P(Y|X).conditional probability P(Y|X).

Association rules for spatial Association rules for spatial data.data.

Co-location rules attempt to generalise Co-location rules attempt to generalise association rules to point collection data association rules to point collection data sets that are indexed by space. The co-sets that are indexed by space. The co-location pattern discovery process finds location pattern discovery process finds frequently co-located subsets of spatial frequently co-located subsets of spatial event types given a map of their locationsevent types given a map of their locations

Examples of co-location patterns: Examples of co-location patterns: predator-prey species, symbiosis, Dental predator-prey species, symbiosis, Dental health and fluoride.health and fluoride.


Co-location extends traditional ARM to where Co-location extends traditional ARM to where the set of transactions is a continuum in a the set of transactions is a continuum in a space, but we needspace, but we need additional definitions of both additional definitions of both neighbourneighbour (say radius) and the s (say radius) and the statistical weight tatistical weight of neighbour.of neighbour. Use spatial statistic, the K Use spatial statistic, the K function, to measure the correlation between function, to measure the correlation between one (same var) and two point (diff. var) one (same var) and two point (diff. var) patterns. K can measure If no spatial correlation, patterns. K can measure If no spatial correlation, attraction, repulsion, between variables attraction, repulsion, between variables (predator-prey). (predator-prey).


Either the antecedent or the consequent of the rule will generally contain a spatial Either the antecedent or the consequent of the rule will generally contain a spatial predicate (e.g. within X) These could be arranged as follows:predicate (e.g. within X) These could be arranged as follows:

Non-spatial antecedent and spatial consequent. All primary schools are located close Non-spatial antecedent and spatial consequent. All primary schools are located close to new suburban housing estates.to new suburban housing estates.

Spatial antecedent and non-spatial consequent. Houses located close to the bay are Spatial antecedent and non-spatial consequent. Houses located close to the bay are expensive.expensive.

Spatial antecedent and spatial consequent. Residential properties located in the city Spatial antecedent and spatial consequent. Residential properties located in the city are south of the river. Here the antecedent also has a non-spatial filter 'residential'are south of the river. Here the antecedent also has a non-spatial filter 'residential'

The The participationparticipation ratio ratio and and participationparticipation index index are two measures which replace are two measures which replace supportsupport and and confidenceconfidence here. The participation ratio is the number of row instances of here. The participation ratio is the number of row instances of co-location C divided by number of instances of Fi. co-location C divided by number of instances of Fi.

Example of spatial assocaition ruleExample of spatial assocaition rule is_a(x, big_town) /\ is_a(x, big_town) /\ intersect(x, highway) -> intersect(x, highway) -> adjacent_to (x, river)adjacent_to (x, river) [support=7%, confidence=85%][support=7%, confidence=85%] [[participationparticipation =7%, =7%, participationparticipation =85%] =85%]

lecture 6 data mining dt786 semester 2 2011-12 pat browne

Documents