chapter dm:ii (continued) - bauhaus university, dm:ii (continued) ii.cluster analysis q cluster...

Download Chapter DM:II (continued) - Bauhaus University, DM:II (continued) II.Cluster Analysis q Cluster Analysis Basics q Hierarchical Cluster Analysis q Iterative Cluster Analysis q Density-Based

Post on 11-Apr-2018

218 views

Category:

Documents

5 download

Embed Size (px)

TRANSCRIPT

  • Chapter DM:II (continued)

    II. Cluster Analysisq Cluster Analysis Basicsq Hierarchical Cluster Analysisq Iterative Cluster Analysisq Density-Based Cluster Analysisq Cluster Evaluationq Constrained Cluster Analysis

    DM:II-270 Cluster Analysis STEIN 2007-2018

  • Constrained Cluster AnalysisPerson Resolution Task

    DM:II-271 Cluster Analysis STEIN 2007-2018

  • Constrained Cluster AnalysisPerson Resolution Task

    DM:II-272 Cluster Analysis STEIN 2007-2018

  • Constrained Cluster AnalysisPerson Resolution Task

    target nameMichael Jordan

    other names

    DM:II-273 Cluster Analysis STEIN 2007-2018

  • Constrained Cluster AnalysisPerson Resolution Task

    target nameMichael Jordan

    other names

    target nameMichael Jordan

    (referent 1)

    other names...

    target nameMichael Jordan

    (referent r)

    other names

    The basket ball player. The statistician.

    DM:II-274 Cluster Analysis STEIN 2007-2018

  • Constrained Cluster AnalysisPerson Resolution Task

    target nameMichael Jordan

    other names

    target nameMichael Jordan

    (referent 1)

    other names...

    target nameMichael Jordan

    (referent r)

    other names

    The basket ball player. The statistician.

    q Multi-document resolution task:

    Names, Target names: N = {n1, . . . , nl}, T NReferents: R = {r1, . . . , rm}, : R T , |R| |T |Documents: D = {d1, . . . , dn}, : D P(N), |(di) T | = 1

    A solution: : D R, s.t. ((di)) (di)

    DM:II-275 Cluster Analysis STEIN 2007-2018

  • Constrained Cluster AnalysisPerson Resolution Task

    target nameMichael Jordan

    other names

    target nameMichael Jordan

    (referent 1)

    other names...

    target nameMichael Jordan

    (referent r)

    other names

    The basket ball player. The statistician.

    q Facts about the Spock data mining challenge:

    Target names: |T | = 44Referents: |R| = 1101Documents: |Dtrain| = 27 000 (labeled 2.3GB)

    |Dtest | = 75 000 (unlabeled 7.8GB)

    DM:II-276 Cluster Analysis STEIN 2007-2018

  • Constrained Cluster AnalysisPerson Resolution Task

    target nameMichael Jordan

    other names

    target nameMichael Jordan

    (referent 1)

    other names...

    target nameMichael Jordan

    (referent r)

    other names

    The basket ball player. The statistician.

    q Facts about the Spock data mining challenge:

    Target names: |T | = 44Referents: |R| = 1101Documents: |Dtrain| = 27 000 (labeled 2.3GB)

    |Dtest | = 75 000 (unlabeled 7.8GB)

    0

    25

    50

    75

    100

    0 5 10 15 20 25 30 35 40 45

    # R

    efer

    ents

    per

    targ

    et n

    ame

    Targetname

    q up to 105 referentsfor a single target name

    q about 25 referentson average per target name

    0

    50

    100

    150

    200

    250

    # D

    ocum

    ents

    per

    refe

    rent

    0 200 400 600 800 1000 Referent

    q about 23 documentson average per referent

    DM:II-277 Cluster Analysis STEIN 2007-2018

  • Constrained Cluster AnalysisApplied to Multi-Document Resolution

    Referent 1 Referent 2

    1. Model similarities new and established retrieval models:

    q global and context-based vector space modelsq explicit semantic analysisq ontology alignment

    2. Learn class memberships (supervised) logistic regression

    3. Find equivalence classes (unsupervised) cluster analysis:

    (a) adaptive graph thinning(b) multiple, density-based cluster analysis(c) clustering selection by expected density maximization

    DM:II-278 Cluster Analysis STEIN 2007-2018

  • Constrained Cluster AnalysisApplied to Multi-Document Resolution

    Referent 1 Referent 2

    1. Model similarities new and established retrieval models:

    q global and context-based vector space modelsq explicit semantic analysisq ontology alignment

    2. Learn class memberships (supervised) logistic regression

    3. Find equivalence classes (unsupervised) cluster analysis:

    (a) adaptive graph thinning(b) multiple, density-based cluster analysis(c) clustering selection by expected density maximization

    DM:II-279 Cluster Analysis STEIN 2007-2018

  • Constrained Cluster AnalysisApplied to Multi-Document Resolution

    Referent 1 Referent 2

    1. Model similarities new and established retrieval models:

    q global and context-based vector space modelsq explicit semantic analysisq ontology alignment

    2. Learn class memberships (supervised) logistic regression

    3. Find equivalence classes (unsupervised) cluster analysis:

    (a) adaptive graph thinning(b) multiple, density-based cluster analysis(c) clustering selection by expected density maximization

    DM:II-280 Cluster Analysis STEIN 2007-2018

  • Constrained Cluster AnalysisApplied to Multi-Document Resolution

    Referent 1 Referent 2

    1. Model similarities new and established retrieval models:

    q global and context-based vector space modelsq explicit semantic analysisq ontology alignment

    2. Learn class memberships (supervised) logistic regression

    3. Find equivalence classes (unsupervised) cluster analysis:

    (a) adaptive graph thinning(b) multiple, density-based cluster analysis(c) clustering selection by expected density maximization

    DM:II-281 Cluster Analysis STEIN 2007-2018

  • Constrained Cluster AnalysisApplied to Multi-Document Resolution

    Referent 1 Referent 2

    1. Model similarities new and established retrieval models:

    q global and context-based vector space modelsq explicit semantic analysisq ontology alignment

    2. Learn class memberships (supervised) logistic regression

    3. Find equivalence classes (unsupervised) cluster analysis:

    (a) adaptive graph thinning(b) multiple, density-based cluster analysis(c) clustering selection by expected density maximization

    DM:II-282 Cluster Analysis STEIN 2007-2018

  • Constrained Cluster AnalysisApplied to Multi-Document Resolution

    Referent 1 Referent 2

    1. Model similarities new and established retrieval models:

    q global and context-based vector space modelsq explicit semantic analysisq ontology alignment

    2. Learn class memberships (supervised) logistic regression

    3. Find equivalence classes (unsupervised) cluster analysis:

    (a) adaptive graph thinning(b) multiple, density-based cluster analysis(c) clustering selection by expected density maximization

    DM:II-283 Cluster Analysis STEIN 2007-2018

  • Constrained Cluster AnalysisIdealized Class Membership Distribution over Similarities

  • Constrained Cluster AnalysisMembership Distribution under tfidf Vector Space Model

  • Constrained Cluster AnalysisMembership Distribution under Context-Based Vector Space Model

  • Constrained Cluster AnalysisMembership Distribution under Ontology Alignment Model

  • Constrained Cluster AnalysisIn-Depth: Multi-Class Hierarchical Classification

    Flat (big-bang) classification

    ...

    Hierarchical (top-down) classification

    + simple realization loss of discriminative power with

    increasing number of categories

    + specialized classifiers(divide and conquer)

    misclassification at higher levelscan never become repaired

    DM:II-288 Cluster Analysis STEIN 2007-2018

  • Constrained Cluster AnalysisIn-Depth: Multi-Class Hierarchical Classification

    State of the art of effectiveness analyses:

    1. independence assumption between categories2. neglection of both hierarchical structure and degree of misclassification

    (a)

    (b)

    Improvements:

    q Consider similarity (Ci, Cj) between correct and wrong category.q Consider graph distance d(Ci, Cj) between correct and wrong category.

    DM:II-289 Cluster Analysis STEIN 2007-2018

  • Constrained Cluster AnalysisIn-Depth: Multi-Class Hierarchical Classification

    Improvements continued:

    Multi-label (multi path) classification

    p > p >

    Multi-classifier (ensemble) classification

    q traverse more than one path andreturn all labels

    q employ probabilistic classifiers witha threshold: split a path or not

    q classification result is a majoritydecision

    q employ different classifier (differenttypes or differently parameterized)

    DM:II-290 Cluster Analysis STEIN 2007-2018

  • Constrained Cluster AnalysisMembership Distribution under Optimized Retrieval Model Combination

  • Constrained Cluster AnalysisMembership Distribution under Optimized Retrieval Model Combination

  • Constrained Cluster AnalysisMembership Distribution under Optimized Retrieval Model Combination

  • Constrained Cluster AnalysisMembership Distribution under Optimized Retrieval Model Combination

  • Constrained Cluster AnalysisMembership Distribution under Optimized Retrieval Model Combination

  • Constrained Cluster AnalysisIn-Depth: Analysis of Classifier Effectiveness

  • Constrained Cluster AnalysisIn-Depth: Analysis of Classifier Effectiveness

  • Constrained Cluster AnalysisIn-Depth: Analysis of Classifier Effectiveness

    Assumption: uniform distribution of referents over documents (here: 25 clusters with |C| = 23)

    |TP| true 1-similarities per cluster (here: 130 @ threshold 0.725) |TP||C| degree of true positives per node (here: 11) |TP|( 1precision 1) false 1-similarities per cluster (here: 760)

    Density-based cluster analysis: effective false positives, FP, connect to same cluster

    analyze P (|FP| > k | D,Riid) (here: E(|FP