# chapter dm:ii (continued) - bauhaus university, dm:ii (continued) ii.cluster analysis q cluster...

Post on 11-Apr-2018

218 views

Category:

## Documents

Embed Size (px)

TRANSCRIPT

• Chapter DM:II (continued)

II. Cluster Analysisq Cluster Analysis Basicsq Hierarchical Cluster Analysisq Iterative Cluster Analysisq Density-Based Cluster Analysisq Cluster Evaluationq Constrained Cluster Analysis

DM:II-270 Cluster Analysis STEIN 2007-2018

• Constrained Cluster AnalysisPerson Resolution Task

DM:II-271 Cluster Analysis STEIN 2007-2018

• Constrained Cluster AnalysisPerson Resolution Task

DM:II-272 Cluster Analysis STEIN 2007-2018

• Constrained Cluster AnalysisPerson Resolution Task

target nameMichael Jordan

other names

DM:II-273 Cluster Analysis STEIN 2007-2018

• Constrained Cluster AnalysisPerson Resolution Task

target nameMichael Jordan

other names

target nameMichael Jordan

(referent 1)

other names...

target nameMichael Jordan

(referent r)

other names

The basket ball player. The statistician.

DM:II-274 Cluster Analysis STEIN 2007-2018

• Constrained Cluster AnalysisPerson Resolution Task

target nameMichael Jordan

other names

target nameMichael Jordan

(referent 1)

other names...

target nameMichael Jordan

(referent r)

other names

The basket ball player. The statistician.

q Multi-document resolution task:

Names, Target names: N = {n1, . . . , nl}, T NReferents: R = {r1, . . . , rm}, : R T , |R| |T |Documents: D = {d1, . . . , dn}, : D P(N), |(di) T | = 1

A solution: : D R, s.t. ((di)) (di)

DM:II-275 Cluster Analysis STEIN 2007-2018

• Constrained Cluster AnalysisPerson Resolution Task

target nameMichael Jordan

other names

target nameMichael Jordan

(referent 1)

other names...

target nameMichael Jordan

(referent r)

other names

The basket ball player. The statistician.

q Facts about the Spock data mining challenge:

Target names: |T | = 44Referents: |R| = 1101Documents: |Dtrain| = 27 000 (labeled 2.3GB)

|Dtest | = 75 000 (unlabeled 7.8GB)

DM:II-276 Cluster Analysis STEIN 2007-2018

• Constrained Cluster AnalysisPerson Resolution Task

target nameMichael Jordan

other names

target nameMichael Jordan

(referent 1)

other names...

target nameMichael Jordan

(referent r)

other names

The basket ball player. The statistician.

q Facts about the Spock data mining challenge:

Target names: |T | = 44Referents: |R| = 1101Documents: |Dtrain| = 27 000 (labeled 2.3GB)

|Dtest | = 75 000 (unlabeled 7.8GB)

0

25

50

75

100

0 5 10 15 20 25 30 35 40 45

# R

efer

ents

per

targ

et n

ame

Targetname

q up to 105 referentsfor a single target name

q about 25 referentson average per target name

0

50

100

150

200

250

# D

ocum

ents

per

refe

rent

0 200 400 600 800 1000 Referent

q about 23 documentson average per referent

DM:II-277 Cluster Analysis STEIN 2007-2018

• Constrained Cluster AnalysisApplied to Multi-Document Resolution

Referent 1 Referent 2

1. Model similarities new and established retrieval models:

q global and context-based vector space modelsq explicit semantic analysisq ontology alignment

2. Learn class memberships (supervised) logistic regression

3. Find equivalence classes (unsupervised) cluster analysis:

(a) adaptive graph thinning(b) multiple, density-based cluster analysis(c) clustering selection by expected density maximization

DM:II-278 Cluster Analysis STEIN 2007-2018

• Constrained Cluster AnalysisApplied to Multi-Document Resolution

Referent 1 Referent 2

1. Model similarities new and established retrieval models:

q global and context-based vector space modelsq explicit semantic analysisq ontology alignment

2. Learn class memberships (supervised) logistic regression

3. Find equivalence classes (unsupervised) cluster analysis:

(a) adaptive graph thinning(b) multiple, density-based cluster analysis(c) clustering selection by expected density maximization

DM:II-279 Cluster Analysis STEIN 2007-2018

• Constrained Cluster AnalysisApplied to Multi-Document Resolution

Referent 1 Referent 2

1. Model similarities new and established retrieval models:

q global and context-based vector space modelsq explicit semantic analysisq ontology alignment

2. Learn class memberships (supervised) logistic regression

3. Find equivalence classes (unsupervised) cluster analysis:

(a) adaptive graph thinning(b) multiple, density-based cluster analysis(c) clustering selection by expected density maximization

DM:II-280 Cluster Analysis STEIN 2007-2018

• Constrained Cluster AnalysisApplied to Multi-Document Resolution

Referent 1 Referent 2

1. Model similarities new and established retrieval models:

q global and context-based vector space modelsq explicit semantic analysisq ontology alignment

2. Learn class memberships (supervised) logistic regression

3. Find equivalence classes (unsupervised) cluster analysis:

(a) adaptive graph thinning(b) multiple, density-based cluster analysis(c) clustering selection by expected density maximization

DM:II-281 Cluster Analysis STEIN 2007-2018

• Constrained Cluster AnalysisApplied to Multi-Document Resolution

Referent 1 Referent 2

1. Model similarities new and established retrieval models:

q global and context-based vector space modelsq explicit semantic analysisq ontology alignment

2. Learn class memberships (supervised) logistic regression

3. Find equivalence classes (unsupervised) cluster analysis:

(a) adaptive graph thinning(b) multiple, density-based cluster analysis(c) clustering selection by expected density maximization

DM:II-282 Cluster Analysis STEIN 2007-2018

• Constrained Cluster AnalysisApplied to Multi-Document Resolution

Referent 1 Referent 2

1. Model similarities new and established retrieval models:

q global and context-based vector space modelsq explicit semantic analysisq ontology alignment

2. Learn class memberships (supervised) logistic regression

3. Find equivalence classes (unsupervised) cluster analysis:

(a) adaptive graph thinning(b) multiple, density-based cluster analysis(c) clustering selection by expected density maximization

DM:II-283 Cluster Analysis STEIN 2007-2018

• Constrained Cluster AnalysisIdealized Class Membership Distribution over Similarities

• Constrained Cluster AnalysisMembership Distribution under tfidf Vector Space Model

• Constrained Cluster AnalysisMembership Distribution under Context-Based Vector Space Model

• Constrained Cluster AnalysisMembership Distribution under Ontology Alignment Model

• Constrained Cluster AnalysisIn-Depth: Multi-Class Hierarchical Classification

Flat (big-bang) classification

...

Hierarchical (top-down) classification

+ simple realization loss of discriminative power with

increasing number of categories

+ specialized classifiers(divide and conquer)

misclassification at higher levelscan never become repaired

DM:II-288 Cluster Analysis STEIN 2007-2018

• Constrained Cluster AnalysisIn-Depth: Multi-Class Hierarchical Classification

State of the art of effectiveness analyses:

1. independence assumption between categories2. neglection of both hierarchical structure and degree of misclassification

(a)

(b)

Improvements:

q Consider similarity (Ci, Cj) between correct and wrong category.q Consider graph distance d(Ci, Cj) between correct and wrong category.

DM:II-289 Cluster Analysis STEIN 2007-2018

• Constrained Cluster AnalysisIn-Depth: Multi-Class Hierarchical Classification

Improvements continued:

Multi-label (multi path) classification

p > p >

Multi-classifier (ensemble) classification

q traverse more than one path andreturn all labels

q employ probabilistic classifiers witha threshold: split a path or not

q classification result is a majoritydecision

q employ different classifier (differenttypes or differently parameterized)

DM:II-290 Cluster Analysis STEIN 2007-2018

• Constrained Cluster AnalysisMembership Distribution under Optimized Retrieval Model Combination

• Constrained Cluster AnalysisMembership Distribution under Optimized Retrieval Model Combination

• Constrained Cluster AnalysisMembership Distribution under Optimized Retrieval Model Combination

• Constrained Cluster AnalysisMembership Distribution under Optimized Retrieval Model Combination

• Constrained Cluster AnalysisMembership Distribution under Optimized Retrieval Model Combination

• Constrained Cluster AnalysisIn-Depth: Analysis of Classifier Effectiveness

• Constrained Cluster AnalysisIn-Depth: Analysis of Classifier Effectiveness

• Constrained Cluster AnalysisIn-Depth: Analysis of Classifier Effectiveness

Assumption: uniform distribution of referents over documents (here: 25 clusters with |C| = 23)

|TP| true 1-similarities per cluster (here: 130 @ threshold 0.725) |TP||C| degree of true positives per node (here: 11) |TP|( 1precision 1) false 1-similarities per cluster (here: 760)

Density-based cluster analysis: effective false positives, FP, connect to same cluster

analyze P (|FP| > k | D,Riid) (here: E(|FP

Recommended