classification & clustering

60
Computer Science Universite itMaastric ht Institute for Knowledge and Agent Technology Classification & Clustering Pieter Spronck http://www.cs.unimaas.nl/p.spronck

Upload: danyl

Post on 10-Feb-2016

39 views

Category:

Documents


0 download

DESCRIPTION

Classification & Clustering. Pieter Spronck http://www.cs.unimaas.nl/p.spronck. Binary Division of Marbles. Big vs. Small. Transparent vs. Opaque. Marble Attributes. Size (big vs. small) Transparency (transparent vs. opaque) Shininess (shiny vs. dull) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Classification & Clustering

ComputerScience

UniversiteitMaastricht

Institute for Knowledgeand Agent Technology

Classification & Clustering

Pieter Spronckhttp://www.cs.unimaas.nl/p.spronck

Page 2: Classification & Clustering

222 Apr 2023

Binary Division of Marbles

Page 3: Classification & Clustering

322 Apr 2023

Big vs. Small

Page 4: Classification & Clustering

422 Apr 2023

Transparent vs. Opaque

Page 5: Classification & Clustering

522 Apr 2023

Marble AttributesSize (big vs. small)Transparency (transparent vs. opaque)Shininess (shiny vs. dull)Colouring (monochrome vs. polychrome)Colour (blue, green, yellow, …)…

Page 6: Classification & Clustering

622 Apr 2023

Grouping of Marbles

Page 7: Classification & Clustering

722 Apr 2023

“Marbles”

Page 8: Classification & Clustering

822 Apr 2023

“Honouring All Distinctions”

Page 9: Classification & Clustering

922 Apr 2023

“Colour Coding”

Page 10: Classification & Clustering

1022 Apr 2023

if transparent then if coloured glass

then group 1else group 3

else group 2

1

2

3

“Natural Grouping”

Page 11: Classification & Clustering

1122 Apr 2023

Types of ClustersUniquely classifying clustersOverlapping clustersProbabilistic clustersDendrograms

Page 12: Classification & Clustering

1222 Apr 2023

Uniquely Classifying Clusters

Page 13: Classification & Clustering

1322 Apr 2023

Overlapping Clusters

Page 14: Classification & Clustering

1422 Apr 2023

Probabilistic ClusteringCluster Green Blue Typical

Samples1 1.0 0.0

2 0.0 1.0

3 0.1 0.9

4 0.5 0.5

Page 15: Classification & Clustering

1522 Apr 2023

Dendrogramopaquetransparent

not clear clear

Page 16: Classification & Clustering

1622 Apr 2023

ClassificationOrdering of entities into groups based on their similarityMinimisation of within-group varianceMaximisation of between-group varianceExhaustive and exclusivePrincipal technique: clustering

Page 17: Classification & Clustering

1722 Apr 2023

Reasons for ClassificationDescriptive powerParsimonyMaintainabilityVersatilityIdentification of distinctive attributes

Page 18: Classification & Clustering

1822 Apr 2023

Typology vs. TaxonomyTypology – conceptualTaxonomy – empirical

Page 19: Classification & Clustering

1922 Apr 2023

TypologyDefine conceptual attributesSelect appropriate attributes Create typology matrix (substruction)Insert empirical entities in matrixExtend matrix if necessaryReduce matrix if necessary

Page 20: Classification & Clustering

2022 Apr 2023

Defining Conceptual AttributesMeaningfulFocus on ideal typesOrder of importanceExhaustive domains

Page 21: Classification & Clustering

2122 Apr 2023

Conceptual Marble Attributes

Page 22: Classification & Clustering

2222 Apr 2023

Typology MatrixTransparency

ColouringOpaque Transparent

Monochrome

Polychrome

Page 23: Classification & Clustering

2322 Apr 2023

Matrix ExtensionTransparency

Colouring

Transparent Opaque

Clear Not clear Clear Not clear

MonochromeBig

Small

PolychromeBig

Small

GlassSize

Page 24: Classification & Clustering

2422 Apr 2023

ReductionFunctional reductionPragmatic reductionNumerical reductionReduction by using criterion types

Page 25: Classification & Clustering

2522 Apr 2023

Functional ReductionTransparency

Colouring

Transparent Opaque

Clear Not clear Clear Not clear

MonochromeBig

Small

PolychromeBig

Small

GlassSize

Page 26: Classification & Clustering

2622 Apr 2023

Functionally Reduced MatrixTransparency

Colouring

Transparent

OpaqueClear Not clear

MonochromeBig

Small

PolychromeBig

Small

GlassSize

Page 27: Classification & Clustering

2722 Apr 2023

Pragmatic ReductionTransparency

Colouring

Transparent Opaque

Clear Not clear Clear Not clear

MonochromeBig

Small

PolychromeBig

Small

GlassSize

Page 28: Classification & Clustering

2822 Apr 2023

Pragmatically Reduced MatrixTransparency

Size

Transparent

OpaqueClear Not clear

SmallMonochrome

Polychrome

Big

GlassColouring

Page 29: Classification & Clustering

2922 Apr 2023

Criticising Typological ClassificationReificationResilienceProblematic attribute selectionUnmanageability

Page 30: Classification & Clustering

3022 Apr 2023

TaxonomyDefine empirical attributesSelect appropriate attributesCreate entity matrixApply clustering techniqueAnalyse clusters

Page 31: Classification & Clustering

3122 Apr 2023

Empirical Attributes

Big

Single colour

Lots of colours

Green glass

TransparentBlue

Yellow

White Dull

Shiny

Page 32: Classification & Clustering

3222 Apr 2023

Selecting AttributesSize (big/small)Colour (yellow, green, blue, red, white…)Colouring (monochrome/polychrome)Shininess (shiny/dull)Transparency (transparent/opaque)Glass colour (clear, green, …)

Page 33: Classification & Clustering

3322 Apr 2023

Entity MatrixBig Monochrome Shiny Transparent Big Monochrome Shiny Transparent

N Y Y Y N Y Y NN Y Y Y N Y Y NN Y Y Y N Y Y NN Y Y Y N Y Y NN Y Y Y N Y Y NN N N N N N Y Y

Y N N N Y N Y Y

Y Y Y N

Page 34: Classification & Clustering

3422 Apr 2023

Automatic Clustering ParametersAgglomerative vs. divisiveMonothetic vs. polytheticOutliers permittedLimits to number of clustersForm of linkage (single, complete, average)…

Page 35: Classification & Clustering

3522 Apr 2023

Automatic Clustering

NYYYsmall, monochrome, shiny, transparent

*NNNpolychrome, dull, opaque

*NYYpolychrome, shiny, transparent

YYYNbig, monochrome, shiny, opaque

NYYNsmall, monochrome, shiny, opaque

Page 36: Classification & Clustering

3622 Apr 2023

Polythetic to Monothetic

NYYYsmall, monochrome,shiny, transparent

*NNNpolychrome, dull, opaque

*NYYpolychrome, shiny, transparent

NYYNsmall, monochrome,

shiny, opaque

*YYNmonochrome, shiny, opaque

Page 37: Classification & Clustering

3722 Apr 2023

Analysing Clusters

“Vanilla”

“Stone”

“Tiger”

“Classic”

small, monochrome,shiny, transparent polychrome, dull,

opaque

polychrome, shiny,transparent

small, monochrome,shiny, opaque

Page 38: Classification & Clustering

3822 Apr 2023

Criticising Taxonomical ClassificationDependent on specimensDifficult to generaliseDifficult to labelBiased towards academic disciplineNot the “last word”

Page 39: Classification & Clustering

3922 Apr 2023

Typology vs. TaxonomyTypology TaxonomyConceptual Empirical

Subjective Objective

Manual (Mostly) automatic

Less discriminative More discriminative

Goes awry when there are insufficient insights

Goes awry when there are insufficient specimens

Page 40: Classification & Clustering

4022 Apr 2023

Operational ClassificationTypology

(conceptual)

Taxonomy(empirical)

Operational typology(conceptual + empirical)

Page 41: Classification & Clustering

4122 Apr 2023

Automated Clustering MethodsIterative distance-based clustering: the k-means methodIncremental clustering:the Cobweb methodProbability-based clustering:the EM algorithm

Page 42: Classification & Clustering

4222 Apr 2023

k-Means MethodIterative distance-based clustering DivisivePolytheticPredefined number of clusters (k)Outliers permitted

Page 43: Classification & Clustering

4322 Apr 2023

k-Means (pass 1)

k = 2attributes: size (big/small), colouring (monochrome/polychrome), shininess (shiny/dull), transparency (transparent/opaque)

??

Page 44: Classification & Clustering

4422 Apr 2023

k-Means (pass 2)

Cluster average:small, monochrome,shiny, transparent.

Cluster average:small, polychrome,dull, opaque

k = 2attributes: size (big/small), colouring (monochrome/polychrome), shininess (shiny/dull), transparency (transparent/opaque)

Page 45: Classification & Clustering

4522 Apr 2023

k-Means (pass 3)

Cluster average:small, monochrome,shiny, transparent.

Cluster average:big, polychrome,dull, opaque

k = 2attributes: size (big/small), colouring (monochrome/polychrome), shininess (shiny/dull), transparency (transparent/opaque)

?

Page 46: Classification & Clustering

4622 Apr 2023

Cobweb AlgorithmIncremental clustering AgglomerativePolytheticDynamic number of clustersOutliers permitted

Page 47: Classification & Clustering

4722 Apr 2023

Cobweb ProcedureBuilds a tree by adding instances to itUses a Category Utility function to determine the quality of the clusteringChanges the tree structure if this positively influences the Category Utility (by merging nodes or splitting nodes)“Cutoff” value may be used to group sufficiently similar instances together

Page 48: Classification & Clustering

4822 Apr 2023

Category Utility

Measure for quality of clusteringThe better the predictive value of the average attribute values of the instances in the clusters for the individual attribute values, the higher the CU will be

k

vaCvaCCCCU i j ijiiji

k

22

1

Pr|PrPr,...,

Page 49: Classification & Clustering

4922 Apr 2023

Category Utility for “Size” (1)

C1 C2

a) Pr[size=big|C1] = 1/3b) Pr[size=big|C2] = 1/3c) Pr[size=big] = 1/3d) Pr[C1] = 1/2

e) Pr[size=small|C1] = 2/3f) Pr[size=small|C2] = 2/3g) Pr[size=small] = 2/3h) Pr[C2] = 1/2

CU = (d((a2–c2)+(e2–g2))+h((b2–c2)+(f2–g2)))/2 = 0

Page 50: Classification & Clustering

5022 Apr 2023

Category Utility for “Size” (2)

C1 C2

a) Pr[size=big|C1] = 2/3b) Pr[size=big|C2] = 0c) Pr[size=big] = 1/3d) Pr[C1] = 1/2

e) Pr[size=small|C1] = 1/3f) Pr[size=small|C2] = 1g) Pr[size=small] = 2/3h) Pr[C2] = 1/2

CU = (d((a2–c2)+(e2–g2))+h((b2–c2)+(f2–g2)))/2 = ((1/2)((1/3)+(–1/3))+(1/2)((–1/9)+(5/9)))/2 = 1/9

Page 51: Classification & Clustering

5122 Apr 2023

Category Utility for “Size” (3)

C1 C2

a) Pr[size=big|C1] = 1b) Pr[size=big|C2] = 0c) Pr[size=big] = 1/3d) Pr[C1] = 1/3

e) Pr[size=small|C1] = 0f) Pr[size=small|C2] = 1g) Pr[size=small] = 2/3h) Pr[C2] = 1/2

CU = (d((a2–c2)+(e2–g2))+h((b2–c2)+(f2–g2)))/2 = ((1/3)((8/9)+(–4/9))+(2/3)((–1/9)+(5/9)))/2 = 2/9

Page 52: Classification & Clustering

5222 Apr 2023

Cobweb Example

12

attributes: size (big/small), colouring (monochrome/polychrome), shininess (shiny/dull), transparency (transparent/opaque)

Page 53: Classification & Clustering

5322 Apr 2023

Cobweb Result Example

attributes: size (big/small), colouring (monochrome/polychrome), shininess (shiny/dull), transparency (transparent/opaque)

Page 54: Classification & Clustering

5422 Apr 2023

Cobweb NumericalProbability of values of attributes of instances in a cluster is based on the standard deviation from the estimate for the mean valueAcuity is presumed variance in attribute values

Page 55: Classification & Clustering

5522 Apr 2023

Disadvantages of Previous MethodsFast and hard to judgeDependent on initial setupAd-hoc limitationsHard to escape from local minima

Page 56: Classification & Clustering

5622 Apr 2023

Probability-based ClusteringFinite mixture modelsEach cluster is defined by a vector of probabilities for instances to have certain values for their attributes, and a probability for instances to reside in the cluster. Clustering equals searching for optimal sets of probabilities for a sample set

Page 57: Classification & Clustering

5722 Apr 2023

Expectation-Maximisation (EM)Probability-based clusteringDivisivePolytheticPredefined number of clusters (k)Outliers permitted

Page 58: Classification & Clustering

5822 Apr 2023

EM ProcedureSelect k cluster vectors randomlyCalculate cluster probabilities for each instance (under the assumption that the instance attributes are independent)Use calculations to re-estimate valuesRepeat until increase in quality becomes negligible

Page 59: Classification & Clustering

5922 Apr 2023

EM Result ExamplepC1=0.2pbig=0.6pmonochrome=0.3pshiny=0.4ptransparent=0.4

pC2=0.8 pbig=0.2pmonochrome=0.8pshiny=0.9ptransparent=0.5

.2*.4*.3*.4*.6=0.0058 .8*.8*.8*.9*.5=0.2304.2*.4*.7*.6*.6=0.0202 .8*.8*.2*.1*.5=0.0064.2*.6*.7*.4*.4=0.0134 .8*.2*.2*.9*.5=0.0144

Page 60: Classification & Clustering

6022 Apr 2023

The Essence of ClassificationA successful classification defines fundamental characteristicsA classification can never be better than the attributes it is based uponThere is no magic formula