tÌm hiỂu gom cỤm dỮ liỆu vÀ hỌ

Upload: hosyky

Post on 06-Apr-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/3/2019 TM HIU GOM CM D LIU V H

    1/23

    TM HIU GOM CM D LIU

    V H GII THUT K-MEAN

  • 8/3/2019 TM HIU GOM CM D LIU V H

    2/23

    GOM CM D LIU

    Gom cm d liu l mt tc v trong khaiph d liu.

    Gom cm d liu gip ta c th h thng lid liu lm cho chng khng b ri rc.

    Vi mt c s d liu ln v ri rc th vicgom cm rt cn thit v hu nh l khngth thiu.

  • 8/3/2019 TM HIU GOM CM D LIU V H

    3/23

    MC CH CA GOM CM

    Mc ch ca gom cm d liu l nhmkhm ph ra cu trc d liu thnh lpcc tp d liu t cc nhm d liu ln

  • 8/3/2019 TM HIU GOM CM D LIU V H

    4/23

    YU CU CA GOM CM D LIU Gom cm d liu l lm cho cc d liu

    trong cm th tng t nhau. Cn ccphn t khc cm th khng tng tnhau.

    tng t gia cc cm d liu do ngidng nh ngha. c xc nh da trncc i tng thuc tnh m t i tng.Thng ta o khon cch gia cc itng.

  • 8/3/2019 TM HIU GOM CM D LIU V H

    5/23

  • 8/3/2019 TM HIU GOM CM D LIU V H

    6/23

    YU CU CA GOM CM D LIU

    Kh nng gom cm tng dn c lp vi dliu nhp

    Kh nng x l d liu a chiu

    Kh nng gom cm da trn rng buc Kh din v kh dng

  • 8/3/2019 TM HIU GOM CM D LIU V H

    7/23

    PHN LOI CC PHNG PHP GOM CM Phn hoch (partitioning): cc phn hochcto

    ra v nh gi theo mt tiu ch no .

    Phn cp (hierarchical): phn r tpdliu/itng c thtphn cp theo mt tiu ch no .

    Da trn mt (density-based): da trn

    connectivity and density functions.

    Da trn li (grid-based): da trn a multiple-levelgranularity structure.

    Da trn m hnh (model-based): mt m hnh githuytca ra cho micm; sau hiuchnhcc thng s m hnh ph hpvicmdliu/itngnht.

  • 8/3/2019 TM HIU GOM CM D LIU V H

    8/23

    PHNG PHP NH GI GOM CM D LIU nh gi ngoi (external validation)

    nh gi ktqu gom cmda vo cu trc cchnhtrccho tpdliu

    o : Rand statistic, Jaccard coefficient, Folkes and Mallowsindex

    nh gi ni (internal validation)

    nh gi ktqu gom cm theo slng cc vector ca chnh tpdliu (ma trngnproximity matrix)

    o : :Huberts statistic, Silhouette index, Dunns index,

    nh gi tngi (relative validation)

    nh gi ktqu gom cmbngvic so snh cc ktqu gomcm khc ngvi cc btr thng s khc nhau

    Tiu ch cho vicnh gi v chnktqu gom cmtiu- nn (compactness): cc itng trong cm nn gn nhau.

    - phn tch (separation): cc cm nn xa nhau.

  • 8/3/2019 TM HIU GOM CM D LIU V H

    9/23

    PHNG PHP NH GI GOM CM D LIU nh gi theo Entropy (trnh khi chtlng

    gom cmtt)

    ii

    ij

    ji

    iji

    ii

    ij

    ji

    ij

    in

    n

    n

    n

    n

    n

    p

    p

    p

    ppIEntropy )log()log()(

  • 8/3/2019 TM HIU GOM CM D LIU V H

    10/23

    CC VN CN GII QUYT BiuDinKiuDLiu

    + Ta ch quan tm nnhngkiu mcnthit cho vic gom cm m thi

    + Ta nhngha d(i,j) l khon cch

    gia 2 itng i v j. d(i,j) 0 d(i,i) = 0

    d(i,j) =d(j,i)

    d(i,j)d(i,k) +d(k,j)vi k l mtimbt k khc i,j.

  • 8/3/2019 TM HIU GOM CM D LIU V H

    11/23

    CC VN CN GII QUYT itng i,j cbiudinbi vector

    x,y tngt(similarity) gia i v j dc

    tnh theo cng thc

    x = (x1, , xp)

    y = (y1, , yp)

    s(x, y) = (x1*y1 + + xp*yp)/((x12+ + xp2)1/2*(y12+ + yp2)1/2)

  • 8/3/2019 TM HIU GOM CM D LIU V H

    12/23

    CC VN CN GII QUYT Interval-scaled variables/attributes

    + khonlch

    + khon cch

    + Z-score measurement

    |)|...|||(|121 fnffffff

    mxmxmxns

    .)...21

    1nffffxx(xnm

    f

    fif

    if s

    mx

    z

  • 8/3/2019 TM HIU GOM CM D LIU V H

    13/23

    CC VN CN GII QUYT Cc cng thc tnh okhon cch

    + okhong cch Minkowski

    + okhon cch Manhattan

    + okhon cch Euclidean

    ||...||||),(2211 pp j

    xi

    xj

    xi

    xj

    xi

    xjid

    )||...|||(|),( 2222

    2

    11 pp jx

    ix

    jx

    ix

    jx

    ixjid

  • 8/3/2019 TM HIU GOM CM D LIU V H

    14/23

    CC VN CN GII QUYT Binary variables/attributes

    Obj j

    Obj ipdbcasum

    dcdc

    baba

    sum

    0

    1

    01

    Hs so trng ngin (nuixng):

    Hs so trng Jaccard (nubtixng):

    dcbacbjid

    ),(

    cbacbjid

    ),(

  • 8/3/2019 TM HIU GOM CM D LIU V H

    15/23

    CC VN CN GII QUYT Variables/attributes of mixed types

    )(1

    )()(1),(

    fij

    pf

    fij

    fij

    pf djid

    Nu xifhoc xjfbthiu (missing) th

    f (variable/attribute): binary (nominal)

    dij(f) = 0 if xif= xjf , or dij

    (f) = 1 otherwise

    f: interval-scaled (Minkowski, Manhattan,

    Euclidean)

    f: ordinal or ratio-scaled

    tnh ranks rifv

    ziftrthnh interval-scaled1

    1

    f

    if

    Mr

    zif

  • 8/3/2019 TM HIU GOM CM D LIU V H

    16/23

    CC VN CN GII QUYT

    1

    1

    f

    if

    Mr

    zif

    1

    1

    f

    if

    Mr

    zif

  • 8/3/2019 TM HIU GOM CM D LIU V H

    17/23

    NGHA CA VIC PHN CM

    Phn cm ta c th i su vo phn tchnghin cu tng cm d liu nhm khmph v tm kim cc thng tin n nhm h

    tr cho vic ra quyt nh

  • 8/3/2019 TM HIU GOM CM D LIU V H

    18/23

    CC GII THUT GOM CM D LIU

    Trong gom cm d liu c nhiu gii thut ,tiu biu l gii thut k-mean v gii thutgom cm phn cp nhm.

    Chng ta s tm hiu gii thut K-Meantrong gom cm d liu

  • 8/3/2019 TM HIU GOM CM D LIU V H

    19/23

    GII THUT K-MEANS INPUT: Mt CSDL gm n i tng v s cc cm k.

    OUTPUT: Cc cm Ci (i=1,..,k) sao cho hm tiu chun E t gi tr ti

    thiu. Bc 1: Khi to

    Chn k i tng mj (j=1...k) l trng tm ban u ca k cm t tp dliu

    (vic la chn ny c th l ngu nhin hoc theo kinh nghim).

    Bc 2: Tnh ton khong cchi vi mi i tng Xi (1

  • 8/3/2019 TM HIU GOM CM D LIU V H

    20/23

    GII THUT K-MEANS phc tp d liu c tnh l

    O(n.k.d.t.T)Trong : n l s i tng d liu

    k l s cm d liu

    d l s chiut l s vng lp

    T l thi gian tnh ton mt

    php tnh c s nh : cng , tr, nhn hocchia.....

  • 8/3/2019 TM HIU GOM CM D LIU V H

    21/23

    GII THUT K-MEANS u im :K-Means phn tch phn cm n

    gin nn c th p dng vi tp d liu ln Nhc im: K-Means ch p dng vi d

    liu c thuc tnh s v khm ph ra cc

    cm c dng hnh cu, k-means cn rtnhy cm vi nhiu v cc phn t ngoi laitrong d liu. Ngoi ra cn ph thuc nhiuvo cc thng s u vo

  • 8/3/2019 TM HIU GOM CM D LIU V H

    22/23

    GII THUT K-MEANS

    Trong trng hp, cc trng tm khi to ban um qu lch so vi cc trng tm cm t nhin thkt qu phn cm ca k-means l rt thp, ngha lcc cm d liu c khm ph rt lch so vi cc

    cm trong thc t. Trn thc t ngi ta cha cmt gii php ti u no chn cc tham s uvo, gii php thngc s dng nht l thnghim vi cc gi tr u vo k khc nhau ri sau

    chn gii php tt nht.

  • 8/3/2019 TM HIU GOM CM D LIU V H

    23/23

    GII THUT K-MEANS n nay, c rt nhiu thut ton k

    tha t tng ca thut ton k-meansp dng trong khai ph d liu giiquyt tp d liu c kch thc rt lnang c p dng rt hiu qu v phbin nh thut ton k-medoid, PAM,CLARA, CLARANS, k- prototypes,