10clusbasic

Upload: phani

Post on 01-Nov-2015

8 views

Category:

Documents


0 download

DESCRIPTION

irs

TRANSCRIPT

  • 5/17/2018 10ClusBasic

    1/101

    Concepts and

    Techniques(3rded.)

    Chapter 10

    Jiawei Han, Micheline Kamber, and Jian PeiUniversity of Illinois at Urbana-Champaign

    !imon "raser University

    #$%&& Han, Kamber Pei' (ll rights reserved'

    1

  • 5/17/2018 10ClusBasic

    2/101

    2

    Chapter 10. Cluster Analsis: !asicConcepts and Methods

    Cluster Analsis: !asic Concepts

    "artitioning Methods

    #ierarchical Methods Densit$!ased Methods

    %rid$!ased Methods

    &'aluation o Clustering

    u**ar

    2

  • 5/17/2018 10ClusBasic

    3/101

    3

    +hat is Cluster Analsis,

    Cl)ster* ( collection of data ob+ects similar or related to one another within the same gro)p dissimilar or )nrelated to the ob+ects in other gro)ps

    Cl)ster analysis or clustering, data segmentation, "inding similarities between data according to the

    characteristics fo)nd in the data and gro)ping similardata ob+ects into cl)sters

    Uns)pervised learning* no predefined classes i'e', learningby observationsvs' learning by e.amples* s)pervised

    /ypical applications (s a stand-alone toolto get insight into data distrib)tion

    (s a preprocessing stepfor other algorithms

  • 5/17/2018 10ClusBasic

    4/101

    -

    Clustering or Data nderstandingand Applications

    0iology* ta.onomy of living things* 1ingdom, phyl)m, class, order,family, gen)s and species Information retrieval* doc)ment cl)stering 2and )se* Identification of areas of similar land )se in an earth

    observation database Mar1eting* Help mar1eters discover distinct gro)ps in their c)stomer

    bases, and then )se this 1nowledge to develop targeted mar1etingprograms

    City-planning* Identifying gro)ps of ho)ses according to their ho)setype, val)e, and geographical location

    3arth-4)a1e st)dies* 5bserved earth 4)a1e epicenters sho)ld becl)stered along continent fa)lts

    Climate* )nderstanding earth climate, find patterns of atmosphericand ocean

    3conomic !cience* mar1et resarch

  • 5/17/2018 10ClusBasic

    5/101

    /

    Clustering as a "reprocessing Tool(tilit)

    !)mmari6ation* Preprocessing for regression, PC(, classification, and

    association analysis Compression*

    Image processing* vector 4)anti6ation "inding K-nearest 7eighbors

    2ocali6ing search to one or a small n)mber of cl)sters 5)tlier detection

    5)tliers are often viewed as those 8far away9 from anycl)ster

  • 5/17/2018 10ClusBasic

    6/101

    ualit: +hat s %oodClustering,

    ( good cl)stering method will prod)ce high 4)alitycl)sters

    high intra-class similarity* cohesivewithin cl)sters

    low inter-class similarity* distinctivebetween cl)sters /he 4)ality of a cl)stering method depends on

    the similarity meas)re )sed by the method

    its implementation, and Its ability to discover some or all of the hidden patterns

  • 5/17/2018 10ClusBasic

    7/101

    Measure the ualit oClustering

    :issimilarity;!imilarity metric !imilarity is e.pressed in terms of a distance f)nction,typically metric* di, j

    /he definitions of distance f)nctionsare )s)ally ratherdifferent for interval-scaled, boolean, categorical, ordinalratio, and vector variables

  • 5/17/2018 10ClusBasic

    8/101

    ons era ons or us erAnalsis

    Partitioning criteria !ingle level vs' hierarchical partitioning often, m)lti-level

    hierarchical partitioning is desirable

    !eparation of cl)sters

    3.cl)sive e'g', one c)stomer belongs to only one region vs' non-e.cl)sive e'g', one doc)ment may belong to more than oneclass

    !imilarity meas)re

    :istance-based e'g', 3)clidian, road networ1, vector vs'connectivity-based e'g', density or contig)ity

    Cl)stering space

    ")ll space often when low dimensional vs' s)bspaces often inhigh-dimensional cl)stering

    4

  • 5/17/2018 10ClusBasic

    9/101

    5equire*ents and Challenges

    !calability Cl)stering all the data instead of only on samples

    (bility to deal with different types of attrib)tes 7)merical, binary, categorical, ordinal, lin1ed, and mi.t)re of these

    Constraint-based cl)stering User may give inp)ts on constraints Use domain 1nowledge to determine inp)t parameters

    Interpretability and )sability 5thers

    :iscovery of cl)sters with arbitrary shape (bility to deal with noisy data Incremental cl)stering and insensitivity to inp)t order High dimensionality

    6

  • 5/17/2018 10ClusBasic

    10/101

    Ma7or Clustering Approaches()

    Partitioning approach* Constr)ct vario)s partitions and then eval)ate them by some

    criterion, e'g', minimi6ing the s)m of s4)are errors /ypical methods* 1-means, 1-medoids, C2(>(7!

    Hierarchical approach* Create a hierarchical decomposition of the set of data or ob+ects

    )sing some criterion /ypical methods* :iana, (gnes, 0I>CH, C(M32357

    :ensity-based approach* 0ased on connectivity and density f)nctions /ypical methods* :0!(C7, 5P/IC!, :enCl)e

    ?rid-based approach* based on a m)ltiple-level gran)larity str)ct)re

    /ypical methods* !/I7?,

  • 5/17/2018 10ClusBasic

    11/101

    Ma7or Clustering Approaches()

    Model-based* ( model is hypothesi6ed for each of the cl)sters and tries to find

    the best fit of that model to each other /ypical methods*3M, !5M, C50

  • 5/17/2018 10ClusBasic

    12/10112

    Chapter 10. Cluster Analsis: !asicConcepts and Methods

    Cluster Analsis: !asic Concepts

    "artitioning Methods

    #ierarchical Methods Densit$!ased Methods

    %rid$!ased Methods

    &'aluation o Clustering

    u**ar

    12

  • 5/17/2018 10ClusBasic

    13/101

    "artitioning Algorith*s: !asicConcept

    Partitioning method* Partitioning a database Dof nob+ects into a set of kcl)sters, s)ch that the s)m of s4)ared distances is minimi6ed where c iis the centroid or medoid of cl)ster C i

    ?iven k, find a partition of k clusters that optimi6es the chosenpartitioning criterion ?lobal optimal* e.ha)stively en)merate all partitions He)ristic methods* k-meansand k-medoidsalgorithms k-meansMac=)een@AB, 2loyd@B;@D$* 3ach cl)ster is represented

    by the center of the cl)ster k-medoidsor P(M Partition aro)nd medoids Ka)fman

    >o)ssee)w@DB* 3ach cl)ster is represented by one of the ob+ects inthe cl)ster

    2

    1 )( iCpk

    i cpE i = =

    13

  • 5/17/2018 10ClusBasic

    14/101

    The K-MeansClustering Method

    ?iven k, the k-meansalgorithm is implemented in fo)rsteps*

    Partition ob+ects into knonempty s)bsets

    Comp)te seed points as the centroids of thecl)sters of the c)rrent partitioning the centroid is

    the center, i'e', mean point, of the cl)ster

    (ssign each ob+ect to the cl)ster with the nearest

    seed point

    ?o bac1 to !tep $, stop when the assignment does

    not change

    1-

  • 5/17/2018 10ClusBasic

    15/101

    An &8a*ple o K-MeansClustering

    92

    Ar;itrarilpartitiono;7ects

    into epeat Comp)te centroid i'e', mean

    point for each partition

    (ssign each ob+ect to the

    cl)ster of its nearest centroid

    Until no change

  • 5/17/2018 10ClusBasic

    16/101

    Co**ents on the K-MeansMethod

    !trength* Efficient* Otkn, where nis E ob+ects, kis E cl)sters, and t isE iterations' 7ormally, k, tFF n'

    Comparing* P(M* 51n-1$, C2(>(* 51s$G 1n-1

    Comment* 5ften terminates at a local optimal'

  • 5/17/2018 10ClusBasic

    17/101

    >ariations o the K-MeansMethod

    Most of the variants of the k-meanswhich differ in !election of the initial kmeans

    :issimilarity calc)lations

    !trategies to calc)late cl)ster means Handling categorical data* k-modes

    >eplacing means of cl)sters with modes

    Using new dissimilarity meas)res to deal with categorical ob+ects Using a fre4)ency-based method to )pdate modes of cl)sters

    ( mi.t)re of categorical and n)merical data* k-prototypemethod

    1

    + at s t e "ro e* o t e 9 Means

  • 5/17/2018 10ClusBasic

    18/101

    + at s t e "ro e* o t e 9$MeansMethod,

    /he 1-means algorithm is sensitive to o)tliers !ince an ob+ect with an e.tremely large val)e may s)bstantially

    distort the distrib)tion of the data

    K-Medoids* Instead of ta1ing the meanval)e of the ob+ect in a cl)ster

    as a reference point, medoidscan be )sed, which is the most

    centrally locatedob+ect in a cl)ster

    %

    &

    $

    A

    B

    D

    &%

    % & $ A B D &%%

    &

    $

    A

    B

    D

    &%

    % & $ A B D &%

    14

  • 5/17/2018 10ClusBasic

    19/10116

    "AM: A Tpical 9$Medoids Algorith*

    0

    1

    2

    3

    -

    /

    4

    6

    10

    0 1 2 3 - / 4 6 10

    Total Cost 2

    0

    1

    2

    3

    -

    /

    4

    6

    10

    0 1 2 3 - / 4 6 10

    92

    Ar;itrarchoose

    < o;7ectasinitial*edoids

    0

    1

    2

    3

    -

    /

    4

    6

    10

    0 1 2 3 - / 4 6 10

    Assigneachre*aini

    ngo;7ecttonearest*edoids 5ando*l select a

    non*edoido;7ect?@ra*do*

    Co*putetotal costosapping

    0

    1

    2

    3

    -

    /

    4

    6

    10

    0 1 2 3 - / 4 6 10

    Total Cost 2

    apping@ and@ra*do*

    qualit isi*pro'ed.

    Do loop

    Until nochange

    0

    1

    2

    3

    -

    /

    4

    6

    10

    0 1 2 3 - / 4 6 10

  • 5/17/2018 10ClusBasic

    20/101

    The 9$Medoid Clustering Method

    K-MedoidsCl)stering* "ind representativeob+ects medoids in cl)sters !MPartitioning (ro)nd Medoids, Ka)fmann >o)ssee)w &DB

    !tarts from an initial set of medoids and iteratively replaces one

    of the medoids by one of the non-medoids if it improves the total

    distance of the res)lting cl)stering

    !Mwor1s effectively for small data sets, b)t does not scale

    well for large data sets d)e to the comp)tational comple.ity

    3fficiency improvement on P(M

    "#!$!Ka)fmann >o)ssee)w, &%* P(M on samples

    "#!$!%&7g Han, &* >andomi6ed re-sampling

    20

  • 5/17/2018 10ClusBasic

    21/10121

    Chapter 10. Cluster Analsis: !asicConcepts and Methods

    Cluster Analsis: !asic Concepts "artitioning Methods

    #ierarchical Methods

    Densit$!ased Methods

    %rid$!ased Methods

    &'aluation o Clustering u**ar

    21

  • 5/17/2018 10ClusBasic

    22/101

    #ierarchical Clustering

    Use distance matri. as cl)stering criteria' /his methoddoes not re4)ire the n)mber of cl)sters kas an inp)t, b)tneeds a termination condition

    Step 0 Step 1 Step 2 Step 3 Step 4

    b

    d

    c

    e

    aa b

    d ec d e

    a b c d e

    Step 4 Step 3 Step 2 Step 1 Step 0

    agglomerative

    (AGNES)

    divisive

    (DIANA)

    22

  • 5/17/2018 10ClusBasic

    23/101

    A%B& (Agglo*erati'e Besting)

    Introd)ced in Ka)fmann and >o)ssee)w &% Implemented in statistical pac1ages, e'g', !pl)s Use the single-linkmethod and the dissimilarity matri. Merge nodes that have the least dissimilarity ?o on in a non-descending fashion 3vent)ally all nodes belong to the same cl)ster

    0

    1

    2

    3

    -

    /

    4

    6

    10

    0 1 2 3 - / 4 6 10

    0

    1

    2

    3

    -

    /

    4

    6

    10

    0 1 2 3 - / 4 6 10

    0

    1

    2

    3

    -

    /

    4

    6

    10

    0 1 2 3 - / 4 6 10

    23

  • 5/17/2018 10ClusBasic

    24/101

    Dendrogram:Shows How Clusters areMerged

    :ecompose data ob+ects into a several levels of nestedpartitioning tree of cl)sters, called a dendrogram

    ( cl)stering of the data ob+ects is obtained by c)ttingthe dendrogram at the desired level, then eachconnected component forms a cl)ster

    2-

  • 5/17/2018 10ClusBasic

    25/101

    DABA (Di'isi'e Analsis)

    Introd)ced in Ka)fmann and >o)ssee)w &%

    Implemented in statistical analysis pac1ages, e'g', !pl)s

    Inverse order of (?73!

    3vent)ally each node forms a cl)ster on its own

    0

    1

    2

    3

    -

    /

    4

    6

    10

    0 1 2 3 - / 4 6 10

    0

    1

    2

    3

    -

    /

    4

    6

    10

    0 1 2 3 - / 4 6 10

    0

    1

    2

    3

    -

    /

    4

    6

    10

    0 1 2 3 - / 4 6 10

    2/

  • 5/17/2018 10ClusBasic

    26/101

    Distance ;eteenClusters

    !ingle lin1* smallest distance between an element in one cl)sterand an element in the other, i'e', distKi, K+ L mintip, t+4

    Complete lin1* largest distance between an element in one cl)sterand an element in the other, i'e', distKi, K+ L ma.tip, t+4

    (verage* avg distance between an element in one cl)ster and anelement in the other, i'e', distKi, K+ L avgtip, t+4

    Centroid* distance between the centroids of two cl)sters, i'e',

    distKi, K+ L distCi, C+ Medoid* distance between the medoids of two cl)sters, i'e', distKi,

    K+ L distMi, M+

    Medoid* a chosen, centrally located ob+ect in the cl)ster

    2

  • 5/17/2018 10ClusBasic

    27/101

    Centroid? 5adius and Dia*eter o aCluster (or nu*erical data sets)

    Centroid* the 8middle9 of a cl)ster

    >adi)s* s4)are root of average distance from any point

    of the cl)ster to its centroid

    :iameter* s4)are root of average mean s4)ared

    distance between all pairs of points in the cl)ster

    N

    tNi ip

    mC)(

    1==

    N

    mciptN

    imR

    2)(1

    =

    =

    )1(

    2)(11

    =

    =

    =

    NN

    iqt

    iptN

    iNi

    mD

    2

  • 5/17/2018 10ClusBasic

    28/101

    &8tensions to #ierarchical Clustering

    Ma+or wea1ness of agglomerative cl)stering methods

    Can never )ndo what was done previo)sly

    :o not scale well* time comple.ity of at least On',

    where nis the n)mber of total ob+ects Integration of hierarchical distance-based cl)stering

    0I>CH &A* )ses C"-tree and incrementally ad+)sts

    the 4)ality of s)b-cl)sters

    CH(M32357 &* hierarchical cl)stering )sing

    dynamic modeling

    24

    !5C# (! l d t ti 5 d i

  • 5/17/2018 10ClusBasic

    29/101

    !5C# (!alanced terati'e 5educingand Clustering sing #ierarchies)

    hang, >ama1rishnan 2ivny, !I?M5:@A Incrementally constr)ct a C" Cl)stering "eat)re tree, a hierarchical

    data str)ct)re for m)ltiphase cl)stering

    Phase &* scan :0 to b)ild an initial in-memory C" tree a m)lti-level

    compression of the data that tries to preserve the inherent cl)steringstr)ct)re of the data

    Phase $* )se an arbitrary cl)stering algorithm to cl)ster the leafnodes of the C"-tree

    &cales linearly* finds a good cl)stering with a single scan and improvesthe 4)ality with a few additional scans

    (eakness)handles only n)meric data, and sensitive to the order of thedata record

    26

  • 5/17/2018 10ClusBasic

    30/101

    Clustering eature >ector in!5C#

    Clustering Feature (CF): CF = (N, LS, SS)

    N: Number of data points

    LS: linear sum of N points:

    SS: square sum of N points

    0

    1

    2

    3

    -

    /

    4

    6

    10

    0 1 2 3 - / 4 6 10

    CF = (5, (16,30),(54,190))

    (3,4)

    (2,6)(4,5)

    (4,7)

    (3,8)

    =

    N

    iiX

    1

    2

    1

    =

    N

    i

    iX

    30

  • 5/17/2018 10ClusBasic

    31/101

    C$Tree in !5C#

    Clustering eature: u**ar o the statistics or a gi'en su;cluster: the 0$th? 1st? and 2nd *o*ents o the su;cluster ro* thestatistical point o 'ie

    5egisters crucial *easure*ents or co*puting cluster

    and utiliEes storage eFcientlA C tree is a height$;alanced tree that stores theclustering eatures or a hierarchical clustering A nonlea node in a tree has descendants or GchildrenH The nonlea nodes store su*s o the Cs o their

    children A C tree has to para*eters

    !ranching actor: *a8 I o children Threshold: *a8 dia*eter o su;$clusters stored at the

    lea nodes31

  • 5/17/2018 10ClusBasic

    32/101

    The C Tree tructure

    CF1

    child1

    CF3

    child3

    CF2

    child2

    CF6

    child6

    CF1

    child1

    CF3

    child3

    CF2

    child2

    CF5

    child5

    CF1 CF2 CF6prev e!t CF1 CF2 CF4prev e!t

    " = 7

    # = 6

    $%%t

    &%'lea %de

    #ea %de #ea %de

    32

  • 5/17/2018 10ClusBasic

    33/101

    The !irch Algorith*

    Cl)ster :iameter

    "or each point in the inp)t "ind closest leaf entry (dd point to leaf entry and )pdate C" If entry diameter N ma.Odiameter, then split leaf, and possibly

    parents (lgorithm is 5n

    Concerns !ensitive to insertion order of data points !ince we fi. the si6e of leaf nodes, so cl)sters may not be so nat)ral Cl)sters tend to be spherical given the radi)s and diameter

    meas)res

    2)()1(

    1j

    xi

    xnn

    33

  • 5/17/2018 10ClusBasic

    34/101

    C#AM&=&@B: #ierarchical Clusteringsing Dna*ic Modeling (1666)

    CH(M32357* ?' Karypis, 3' H' Han, and ' K)mar, & Meas)res the similarity based on a dynamic model

    /wo cl)sters are merged only if the interconnectivityand closeness *proximity+between two cl)sters are

    high relative tothe internal interconnectivity of thecl)sters and closeness of items within the cl)sters

    ?raph-based, and a two-phase algorithm

    &'

    Use a graph-partitioning algorithm* cl)ster ob+ects into alarge n)mber of relatively small s)b-cl)sters

    $' Use an agglomerative hierarchical cl)stering algorithm*find the gen)ine cl)sters by repeatedly combining these

    s)b-cl)sters 3-

  • 5/17/2018 10ClusBasic

    35/101

    @'erall ra*eor< o C#AM&=&@B

    Construct (!NN)

    Sparse Grap" #artition t"e Grap"

    $erge #artition

    Final Clusters

    Data Set

    K-NN Graph

    " and q areconnected i q is

    a*ong the top elies on a density-basednotion of cl)ster* ( clusterisdefined as a ma.imal set of density-connected points

    :iscovers cl)sters of arbitrary shape in spatial databaseswith noise

    C%re

    "%rder

    tlier

    -p = 1c.

    *i+t = 5

    --

  • 5/17/2018 10ClusBasic

    45/101

    D!CAB: The Algorith*

    (rbitrary select a pointp >etrieve all points density-reachable frompw'r't' Epsand

    Mints

    Ifpis a core point, a cl)ster is formed Ifpis a border point, no points are density-reachable

    frompand :0!C(7 visits the ne.t point of the database

    Contin)e the process )ntil all of the points have beenprocessed

    -/

  • 5/17/2018 10ClusBasic

    46/101

    "ara*eters

    -

    "T : A uster$ r er ng Met o

  • 5/17/2018 10ClusBasic

    47/101

    "T : A uster r er ng Met o(1666)

    5P/IC!* 5rdering Points /o Identify the Cl)stering!tr)ct)re (n1erst, 0re)nig, Kriegel, and !ander !I?M5: Prod)ces a special order of the database wrt its

    density-based cl)stering str)ct)re /his cl)ster-ordering contains info e4)iv to the density-

    based cl)sterings corresponding to a broad range ofparameter settings

    ?ood for both a)tomatic and interactive cl)steranalysis, incl)ding finding intrinsic cl)stering str)ct)re

    Can be represented graphically or )sing vis)ali6ationtechni4)es

    -

    @"TC: o*e &8tension ro*

  • 5/17/2018 10ClusBasic

    48/101

    @"TC: o*e &8tension ro*D!CAB

    Inde.-based*

    1 L n)mber of dimensions

    7 L $%

    p L BZ

    M L 7&-p L Comple.ity* 5%log%

    Core :istance* min eps s't' point is core

    >eachability :istance

    p2

    *i+t = 5

    = 3 c.

    *a! (c%re'ditace (%), d (%, p))

    r(p1, %) = 28c. r(p2,%) = 4c.

    %

    %

    p1

    -4

  • 5/17/2018 10ClusBasic

    49/101

    %eac"abilit&!distance

    Cluster!order

    of t"e ob'ects

    undefined

    -6

    Densit$!ased Clustering: @"TC L ts

  • 5/17/2018 10ClusBasic

    50/101

    /0

    Densit !ased Clustering: @"TC L tsApplications

    D&BC=&: sing tatistical Densit

  • 5/17/2018 10ClusBasic

    51/101

    D&BC=&: sing tatistical Densitunctions

    :37sity-based C2Ust3ring by Hinneb)rg Keim K::D Using statistical density f)nctions*

    Ma+or feat)res

    !olid mathematical fo)ndation

    ?ood for data sets with large amo)nts of noise

    (llows a compact mathematical description of arbitrarily shapedcl)sters in high-dimensional data sets

    !ignificant faster than e.isting algorithm e'g', :0!C(7

    0)t needs a large n)mber of parameters

    f x y eGaussian

    d x y

    ( , )( , )

    =

    2

    22 =

    = N

    i

    xxd

    D

    Gaussian

    i

    exf1

    2

    ),(2

    2

    )(

    =

    = N

    i

    xxd

    ii

    D

    Gaussian

    i

    exxxxf1

    2),(

    2

    2

    )(),( inuence o

    on 8

    totalinuence on

    8

    gradient o 8in the

    direction o 8i

    /1

  • 5/17/2018 10ClusBasic

    52/101

    Uses grid cells b)t only 1eeps information abo)t grid cells that doact)ally contain data points and manages these cells in a tree-basedaccess str)ct)re

    Infl)ence f)nction* describes the impact of a data point within itsneighborhood

    5verall density of the data space can be calc)lated as the s)m of theinfl)ence f)nction of all data points Cl)sters can be determined mathematically by identifying density

    attractors :ensity attractors are local ma.imal of the overall density f)nction Center defined cl)sters* assign to each density attractor the points

    density attracted to it (rbitrary shaped cl)ster* merge density attractors that are connected

    thro)gh paths of high density N threshold

    Denclue: Technical &ssence

    /2

  • 5/17/2018 10ClusBasic

    53/101

    Densit Attractor

    /3

  • 5/17/2018 10ClusBasic

    54/101

    Center$DeNned and Ar;itrar

    /-

    Chapter 10 Cluster Analsis: !asic

  • 5/17/2018 10ClusBasic

    55/101

    //

    Chapter 10. Cluster Analsis: !asicConcepts and Methods

    Cluster Analsis: !asic Concepts "artitioning Methods

    #ierarchical Methods

    Densit$!ased Methods

    %rid$!ased Methods

    &'aluation o Clustering u**ar

    //

  • 5/17/2018 10ClusBasic

    56/101

    %rid$!ased Clustering Method

    Using m)lti-resol)tion grid data str)ct)re !everal interesting methods

    !/I7? a !/atistical I7formation ?rid approach by

  • 5/17/2018 10ClusBasic

    57/101

    TB%: A tatistical nor*ation %ridApproach

  • 5/17/2018 10ClusBasic

    58/101

    The TB% Clustering Method

    3ach cell at a high level is partitioned into a n)mber ofsmaller cells in the ne.t lower level

    !tatistical info of each cell is calc)lated and storedbeforehand and is )sed to answer 4)eries

    Parameters of higher level cells can be easily calc)latedfrom parameters of lower level cell count, mean, s, min, max type of distrib)tion\normal, uniform, etc'

    Use a top-down approach to answer spatial data 4)eries !tart from a pre-selected layer\typically with a small

    n)mber of cells "or each cell in the c)rrent level comp)te the confidence

    interval/4

  • 5/17/2018 10ClusBasic

    59/101

    TB% Algorith* and ts Analsis

    >emove the irrelevant cells from f)rther consideration epeat this process )ntil the bottom layer is reached

    (dvantages* =)ery-independent, easy to paralleli6e, incremental

    )pdate O*K+,where Kis the n)mber of grid cells at the lowest

    level :isadvantages*

    (ll the cl)ster bo)ndaries are either hori6ontal orvertical, and no diagonal bo)ndary is detected

    /6

    ( l i )

  • 5/17/2018 10ClusBasic

    60/101

    0

    C=& (Clustering n &st)

    (grawal, ?ehr1e, ?)nop)los, >aghavan !I?M5:@D ()tomatically identifying s)bspaces of a high dimensional data space

    that allow better cl)stering than original space

    C2I=U3 can be considered as both density-based and grid-based

    It partitions each dimension into the same n)mber of e4)al lengthinterval

    It partitions an m-dimensional data space into non-overlappingrectang)lar )nits

    ( )nit is dense if the fraction of total data points contained in the )nite.ceeds the inp)t model parameter

    ( cl)ster is a ma.imal set of connected dense )nits within as)bspace

    h 7

  • 5/17/2018 10ClusBasic

    61/101

    1

    C=&: The Ma7or teps

    Partition the data space and find the n)mber of points that lieinside each cell of the partition'

    Identify the s)bspaces that contain cl)sters )sing the (prioriprinciple

    Identify cl)sters :etermine dense )nits in all s)bspaces of interests :etermine connected dense )nits in all s)bspaces of

    interests'

    ?enerate minimal description for the cl)sters :etermine ma.imal regions that cover a cl)ster of

    connected dense )nits for each cl)ster :etermination of minimal cover for each cl)ster

    0)

    %(

  • 5/17/2018 10ClusBasic

    62/101

    2

    Salar

    (10,0

    00

    20 30 40 50 60ae

    5

    43

    1

    2

    6

    7

    0

    20 30 40 50 60ae

    5

    43

    1

    2

    6

    7

    0

    acati%

    8ee9)

    ae

    acati%

    Sala

    r5 30 50

    = 3

    trength and +ea

  • 5/17/2018 10ClusBasic

    63/101

    3

    CLIQUE

    !trength automaticallyfinds s)bspaces of the highest

    dimensionality s)ch that high density cl)sters e.ist inthose s)bspaces

    insensitiveto the order of records in inp)t and does notpres)me some canonical data distrib)tion

    scaleslinearlywith the si6e of inp)t and has goodscalability as the n)mber of dimensions in the data

    increases

  • 5/17/2018 10ClusBasic

    64/101

    -

    Chapter 10. Cluster Analsis: !asicConcepts and Methods

    Cluster Analsis: !asic Concepts "artitioning Methods

    #ierarchical Methods

    Densit$!ased Methods

    %rid$!ased Methods

    &'aluation o Clustering

    u**ar

    -

  • 5/17/2018 10ClusBasic

    65/101

    Assessing Clustering Tendenc

    (ssess if non-random str)ct)re e.ists in the data by meas)ring theprobability that the data is generated by a )niform data distrib)tion

    /est spatial randomness by statistic test* Hop1ins !tatic ?iven a dataset : regarded as a sample of a random variable o,

    determine how far away o is from being )niformly distrib)ted in thedata space

    !ample npoints,p, , pn, )niformly from :' "or each pi, find itsnearest neighbor in :* xiL min2dist *pi, v+3where vin :

    !ample npoints, 0, , 0n, )niformly from :' "or each 0i, find itsnearest neighbor in : Q0iR* yiL min2dist *0i, v+3where vin : and v

    W 0i Calc)late the Hop1ins !tatistic*

    If : is )niformly distrib)ted, ] .iand ] yiwill be close to each otherand H is close to %'' If : is highly s1ewed, H is close to %

    !

  • 5/17/2018 10ClusBasic

    66/101

    Deter*ine the Bu*;er o Clusters

    3mpirical method E of cl)sters ^_n;$ for a dataset of n points

    3lbow method Use the t)rning point in the c)rve of s)m of within cl)ster variance

    w'r't the E of cl)sters

    Cross validation method :ivide a given data set into mparts Use m` & parts to obtain a cl)stering model Use the remaining part to test the 4)ality of the cl)stering

    3'g', "or each point in the test set, find the closest centroid, and)se the s)m of s4)ared distance between all points in the testset and the closest centroids to meas)re how well the model fitsthe test set

    "or any 1 N %, repeat it mtimes, compare the overall 4)ality meas)re

    w'r't' different k4s, and find E of cl)sters that fits the data the best

  • 5/17/2018 10ClusBasic

    67/101

    Measuring Clustering ualit

    /wo methods* e.trinsic vs' intrinsic 3.trinsic* s)pervised, i'e', the gro)nd tr)th is available

    Compare a cl)stering against the gro)nd tr)th )sing

    certain cl)stering 4)ality meas)re

    3.' 0C)bed precision and recall metrics

    Intrinsic* )ns)pervised, i'e', the gro)nd tr)th is )navailable

    3val)ate the goodness of a cl)stering by considering

    how well the cl)sters are separated, and how compactthe cl)sters are

    3.' !ilho)ette coefficient

    "

    Measuring Clustering ualit: &8trinsich d

  • 5/17/2018 10ClusBasic

    68/101

    Methods

    Cl)stering 4)ality meas)re* 5*", "g+,for a cl)stering "given the gro)nd tr)th "g' 5is good if it satisfies the following 4essential criteria

    Cl)ster homogeneity* the p)rer, the better

    Cl)ster completeness* sho)ld assign ob+ects belong tothe same category in the gro)nd tr)th to the same cl)ster >ag bag* p)tting a heterogeneo)s ob+ect into a p)re

    cl)ster sho)ld be penali6ed more than p)tting it into a ragbagi'e', 8miscellaneo)s9 or 8other9 category

    !mall cl)ster preservation* splitting a small category intopieces is more harmf)l than splitting a large category intopieces

    #

    Chapter 10. Cluster Analsis: !asic

  • 5/17/2018 10ClusBasic

    69/101

    6

    p Concepts and Methods

    Cluster Analsis: !asic Concepts "artitioning Methods

    #ierarchical Methods

    Densit$!ased Methods

    %rid$!ased Methods

    &'aluation o Clustering

    u**ar

    6

    u**ar

  • 5/17/2018 10ClusBasic

    70/101

    u**ar Cl)ster analysisgro)ps ob+ects based on their similarity and has wide

    applications Meas)re of similarity can be comp)ted for vario)s types of data Cl)stering algorithms can be categori6edinto partitioning methods,

    hierarchical methods, density-based methods, grid-based methods,and model-based methods

    K-meansand K-medoidsalgorithms are pop)lar partitioning-basedcl)stering algorithms

    0irchand Chameleonare interesting hierarchical cl)steringalgorithms, and there are also probabilistic hierarchical cl)steringalgorithms

    :0!C(7, 5P/IC!, and :37C2Uare interesting density-basedalgorithms

    !/I7?and C2I=U3are grid-based methods, where C2I=U3 is also as)bspace cl)stering algorithm

    =)ality of cl)stering res)lts can be eval)ated in vario)s ways 0

    ntroduction

  • 5/17/2018 10ClusBasic

    71/101

    1

    ntroduction

    Coverage

    Cl)ster (nalysis* Chapter && 5)tlier :etection* Chapter &$ Mining !e4)ence :ata* 0K$* Chapter D Mining ?raphs :ata* 0K$* Chapter !ocial and Information 7etwor1 (nalysis

    0K$* Chapter Partial coverage* Mar1 7ewman* 87etwor1s* (n Introd)ction9, 5.ford U', $%&% !cattered coverage* 3asley and Kleinberg, 87etwor1s, Crowds, and Mar1ets*

    >easoning (bo)t a Highly Connected

  • 5/17/2018 10ClusBasic

    72/101

    5eerences (1)

    >' (grawal, J' ?ehr1e, :' ?)nop)los, and P' >aghavan' ()tomatic s)bspacecl)stering of high dimensional data for data mining applications' !I?M5:D M' >' (nderberg' Cl)ster (nalysis for (pplications' (cademic Press, &B' M' (n1erst, M' 0re)nig, H'-P' Kriegel, and J' !ander' 5ptics* 5rdering points

    to identify the cl)stering str)ct)re, !I?M5:@' 0eil "', 3ster M', ) '* "re4)ent /erm-0ased /e.t Cl)stering, K::%$ M' M' 0re)nig, H'-P' Kriegel, >' 7g, J' !ander' 25"* Identifying :ensity-0ased2ocal 5)tliers' !I?M5: $%%%' M' 3ster, H'-P' Kriegel, J' !ander, and ' )' ( density-based algorithm for

    discovering cl)sters in large spatial databases' K::A' M' 3ster, H'-P' Kriegel, and ' )' Knowledge discovery in large spatial

    databases* "oc)sing techni4)es for efficient class identification' !!:' :' "isher' Knowledge ac4)isition via incremental concept)al cl)stering'

    Machine 2earning, $*&-&B$, &DB' :' ?ibson, J' Kleinberg, and P' >aghavan' Cl)stering categorical data* (n

    approach based on dynamic systems' 2:0@D' ' ?anti, J' ?ehr1e, >' >ama1rishan' C(C/U! Cl)stering Categorical :ata

    Using !)mmaries' K::'

    2

  • 5/17/2018 10ClusBasic

    73/101

    5eerences (2)

    :' ?ibson, J' Kleinberg, and P' >aghavan' Cl)stering categorical data* (napproach based on dynamic systems' In Proc' 2:0@D'

    !' ?)ha, >' >astogi, and K' !him' C)re* (n efficient cl)stering algorithm forlarge databases' !I?M5:D'

    !' ?)ha, >' >astogi, and K' !him' >5CK* ( rob)st cl)stering algorithm forcategorical attrib)tes' In 6"/E788, pp' &$-$&, !ydney, ()stralia, March&'

    (' Hinneb)rg, :'l (' Keim* (n 3fficient (pproach to Cl)stering in 2argeM)ltimedia :atabases with 7oise' K::@D'

    (' K' Jain and >' C' :)bes' (lgorithms for Cl)stering :ata' Printice Hall, &DD' ?' Karypis, 3'-H' Han, and ' K)mar' CH(M32357* ( Hierarchical Cl)stering

    (lgorithm Using :ynamic Modeling' "OM9:E$, $D* AD-B, &'

    2' Ka)fman and P' J' >o)ssee)w' "inding ?ro)ps in :ata* an Introd)ction toCl)ster (nalysis' John ' 7g' (lgorithms for mining distance-based o)tliers in large

    datasets' 2:0@D'

    3

  • 5/17/2018 10ClusBasic

    74/101

    5eerences (3)

    ?' J' Mc2achlan and K'3' 01asford' Mi.t)re Models* Inference and (pplications toCl)stering' John ' /' 7g' Constraint-0ased Cl)steringin 2arge :atabases, IC:/%&'

    (' K' H' /)ng, J' Ho), and J' Han' !patial Cl)stering in the Presence of 5bstacles,IC:3%&

    H' ' >ama1rishnan, and M' 2ivny' 0I>CH * (n efficient data cl)stering method

    for very large databases' !I?M5:A ' [in, J' Han, and P' !' [), 82in1Cl)s* 3fficient Cl)stering via Heterogeneo)s !emantic

    2in1s9, 2:0%A

    -

  • 5/17/2018 10ClusBasic

    75/101

    !lides )n)sed in class

    /

    A T i l 9 M d id Al ith ("AM)

  • 5/17/2018 10ClusBasic

    76/101

    A Tpical 9$Medoids Algorith* ("AM)

    0

    1

    2

    3

    -

    /

    4

    6

    10

    0 1 2 3 - / 4 6 10

    Total Cost 2

    0

    1

    2

    3

    -

    /

    4

    6

    10

    0 1 2 3 - / 4 6 10

    92

    Ar;itrarchoose< o;7ect

    asinitial*edoids

    0

    1

    2

    3

    -

    /

    4

    6

    10

    0 1 2 3 - / 4 6 10

    Assigneachre*aining

    o;7ecttonearest*edoids 5ando*l select a

    non*edoido;7ect?@ra*do*

    Co*putetotal costosapping

    0

    1

    2

    3

    -

    /

    4

    6

    10

    0 1 2 3 - / 4 6 10

    Total Cost 2

    apping@ and@ra*do*

    qualit isi*pro'ed.

    Do loop

    Until nochange

    0

    1

    2

    3

    -

    /

    4

    6

    10

    0 1 2 3 - / 4 6 10

    "AM ("artitioning Around Medoids)(164)

  • 5/17/2018 10ClusBasic

    77/101

    (164)

    P(M Ka)fman and >o)ssee)w, &DB, b)ilt in !pl)s Use real ob+ect to represent the cl)ster

    !elect krepresentative ob+ects arbitrarily

    "or each pair of non-selected ob+ect hand selectedob+ect i, calc)late the total swapping cost TCih "or each pair of iand h,

    If :"ihF %, iis replaced by h

    /hen assign each non-selected ob+ect to the most

    similar representative ob+ect

    repeat steps $- )ntil there is no change

    "AM Clustering: inding the !est ClusterCenter

  • 5/17/2018 10ClusBasic

    78/101

    4

    Center

    Case &* p c)rrently belongs to o+' If o+is replaced by orandomas arepresentative ob+ect and p is the closest to one of the otherrepresentative ob+ect oi, then p is reassigned to oi

    +hat s the "ro;le* ith "AM,

  • 5/17/2018 10ClusBasic

    79/101

    6

    +hat s the "ro;le* ith "AM,

    Pam is more rob)st than 1-means in the presence ofnoise and o)tliers beca)se a medoid is less infl)enced by

    o)tliers or other e.treme val)es than a mean

    Pam wor1s efficiently for small data sets b)t does not

    scale wellfor large data sets'

    51n-1$ for each iteration

    where n is E of data,1 is E of cl)sters

    !ampling-based method

    C2(>(Cl)stering 2(>ge (pplications

    CLARA(Clustering =arge Applications)(1660)

  • 5/17/2018 10ClusBasic

    80/101

    40

    (1660)

    "#!$!Ka)fmann and >o)ssee)w in &% 0)ilt in statistical analysis pac1ages, s)ch as !Pl)s It draws multiple samplesof the data set, applies !M

    on each sample, and gives the best cl)stering as theo)tp)t

    !trength* deals with larger data sets than !M

  • 5/17/2018 10ClusBasic

    81/101

    41

    (1994)

    "#!$!%&( Cl)stering (lgorithm based on >andomi6ed!earch 7g and Han@ :raws sample of neighbors dynamically /he cl)stering process can be presented as searching a

    graph where every node is a potential sol)tion, that is, aset of kmedoids If the local optim)m is fo)nd, itstarts with new randomly

    selected node in search for a new local optim)m

    (dvantages* More efficient and scalable than both !Mand "#!$! ")rther improvement* "oc)sing techni4)es and spatial

    access str)ct)res 3ster et al'@

    5@C9: Clustering Categorical Data

  • 5/17/2018 10ClusBasic

    82/101

    42

    5@C9: Clustering Categorical Data

    >5CK* >5b)st Cl)stering )sing linKs !' ?)ha, >' >astogi K' !him, IC:3@ Ma+or ideas

    Use lin1s to meas)re similarity;pro.imity

    7ot distance-based (lgorithm* sampling-based cl)stering

    :raw random sample Cl)ster with lin1s 2abel data in dis1

    3.periments Congressional voting, m)shroom data

    i*ilarit Measure in 5@C9

  • 5/17/2018 10ClusBasic

    83/101

    43

    i*ilarit Measure in 5@C9

    /raditional meas)res for categorical data may not wor1 well, e'g',Jaccard coefficient

    3.ample* /wo gro)ps cl)sters of transactions C&' Fa, b, c, d, eN* Qa, b, cR, Qa, b, dR, Qa, b, eR, Qa, c, dR, Qa, c, eR,

    Qa, d, eR, Qb, c, dR, Qb, c, eR, Qb, d, eR, Qc, d, eR

    C$' Fa, b, f, gN* Qa, b, fR, Qa, b, gR, Qa, f, gR, Qb, f, gR Jaccard co-efficient may lead to wrong cl)stering res)lt

    C&* %'$ Qa, b, cR, Qb, d, eRR to %' Qa, b, cR, Qa, b, dR C& C$* co)ld be as high as %' Qa, b, cR, Qa, b, fR

    Jaccard co-efficient-based similarity f)nction*

    3.' 2et :L Qa, b, cR, :' L Qc, d, eR

    Sim

    ( , )1 2 1 2

    1 2

    =

    2405

    1

    ;,,,,epeatedly find gro)ps of tightly related nodes, which

    are merged into a higher-level node /ightness of a gro)p of nodes

    "or a gro)p of nodes

  • 5/17/2018 10ClusBasic

    95/101

    Mining

    "inding tight gro)ps "re4)ent pattern mining

    Proced)re of initiali6ing a tree !tart from leaf nodes level-% (t each level l, find non-overlapping gro)ps of similar nodes

    with fre4)ent pattern mining

    $educed to

    g1

    g2

  • 5/17/2018 10ClusBasic

    96/101

    Ad7usting i*Tree tructures

    (fter similarity changes, the tree str)ct)re also needs to bechanged If a node is more similar to its parent@s sibling, then move

    it to be a child of that sibling /ry to move each node to its parent@s sibling that it is most

    similar to, )nder the constraint that each parent node canhave at most cchildren

    n1 n2

    n4 n5 n6

    n3

    n7 n9n8

    08

    09

    n7

    6

    Co*ple8it

  • 5/17/2018 10ClusBasic

    97/101

    Co*ple8it

    /ime !pace

    Updating similarities OMlog%$ OM1%

    (d+)sting tree str)ct)res O% O%

    #ink"lus OMlog%$ OM1%

    &im$ank OM$ O%$

    or to tpes o o;7ects? Nin each? and Mlin

  • 5/17/2018 10ClusBasic

    98/101

    &8peri*ent: &*ail Dataset

    "' 7ielsen' 3mail dataset'

    www'imm'dt)'d1;rem;data;3mail-&&'6ip B% emails on conferences, $B$ on +obs,

    and BD spam emails (cc)racy* meas)red by man)ally labeled

    data

    (cc)racy of cl)stering* Z of pairs of ob+ectsin the same cl)ster that share common label

    !pproach !ccuracy time *s+

    2in1Cl)s %'D%$A &B'A

    !im>an1 %'BA &A%

    >eCom %'B&& B'A

    "-!im>an1 %'ADD B'B

    C2(>(7! %'BAD D'

    (pproaches compared* !im>an1 Jeh an1 with "ingerPrints "-!im>an1* "ogaras >ac6, eCom

  • 5/17/2018 10ClusBasic

    99/101

    (1664)

    !hei1holeslami, Chatter+ee, and hang 2:0@D ( m)lti-resol)tion cl)stering approach which applies wavelet transform to

    the feat)re spaceV both grid-based and density-based

  • 5/17/2018 10ClusBasic

    100/101

    The +a'eCluster Algorith*

    How to apply wavelet transform to find cl)sters !)mmari6es the data by imposing a m)ltidimensional grid

    str)ct)re onto data space /hese m)ltidimensional spatial data ob+ects are represented in a

    n-dimensional feat)re space

    (pply wavelet transform on feat)re space to find the denseregions in the feat)re space (pply wavelet transform m)ltiple times which res)lt in cl)sters at

    different scales from fine to coarse Ma+or feat)res*

    Comple.ity 57 :etect arbitrary shaped cl)sters at different scales 7ot sensitive to noise, not sensitive to inp)t order 5nly applicable to low dimensional data

    101

    uantiEationL Transor*ation

  • 5/17/2018 10ClusBasic

    101/101

    L Transor*ation

    =)anti6e data into m-: grid str)ct)re,

    then wavelet transforma scale &* high resol)tion

    b scale $* medi)m resol)tion

    c scale * low resol)tion