the hong kong university of science & technology csit 5220: reasoning and decision under...

35
THE HONG KONG UNIVERSITY OF SCIENCE & TECHNOLOGY CSIT 5220: Reasoning and Decision under Uncertainty L10: Model-Based Classification and Clustering Nevin L. Zhang Room 3504, phone: 2358-7015, Email: [email protected] Home page

Upload: augustus-collins

Post on 25-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: THE HONG KONG UNIVERSITY OF SCIENCE & TECHNOLOGY CSIT 5220: Reasoning and Decision under Uncertainty L10: Model-Based Classification and Clustering Nevin

THE HONG KONG UNIVERSITY OF SCIENCE & TECHNOLOGY

CSIT 5220:  Reasoning and Decision under Uncertainty

L10: Model-Based Classification and Clustering

Nevin L. ZhangRoom 3504, phone: 2358-7015,

Email: [email protected]   Home page

Page 2: THE HONG KONG UNIVERSITY OF SCIENCE & TECHNOLOGY CSIT 5220: Reasoning and Decision under Uncertainty L10: Model-Based Classification and Clustering Nevin

CSIT 5220

L10: Model-Based Classification and Clustering

Probabilistic Models (PMs) for Classification

PMs for Clustering

Page 2

Page 3: THE HONG KONG UNIVERSITY OF SCIENCE & TECHNOLOGY CSIT 5220: Reasoning and Decision under Uncertainty L10: Model-Based Classification and Clustering Nevin

CSIT 5220

The problem:

Given data:

Find mapping (A1, A2, …, An) |- C

Possible solutions

ANN

Decision tree (Quinlan)

(SVM: Continuous data)

Classification

Page 4: THE HONG KONG UNIVERSITY OF SCIENCE & TECHNOLOGY CSIT 5220: Reasoning and Decision under Uncertainty L10: Model-Based Classification and Clustering Nevin

CSIT 5220

Probabilistic Approach to Classification

Page 5: THE HONG KONG UNIVERSITY OF SCIENCE & TECHNOLOGY CSIT 5220: Reasoning and Decision under Uncertainty L10: Model-Based Classification and Clustering Nevin

CSIT 5220Page 5

Will Boss Play Tennis?

Page 6: THE HONG KONG UNIVERSITY OF SCIENCE & TECHNOLOGY CSIT 5220: Reasoning and Decision under Uncertainty L10: Model-Based Classification and Clustering Nevin

CSIT 5220Page 6

Will Boss Play Tennis?

Page 7: THE HONG KONG UNIVERSITY OF SCIENCE & TECHNOLOGY CSIT 5220: Reasoning and Decision under Uncertainty L10: Model-Based Classification and Clustering Nevin

CSIT 5220Page 7

Page 8: THE HONG KONG UNIVERSITY OF SCIENCE & TECHNOLOGY CSIT 5220: Reasoning and Decision under Uncertainty L10: Model-Based Classification and Clustering Nevin

CSIT 5220Page 8

Page 9: THE HONG KONG UNIVERSITY OF SCIENCE & TECHNOLOGY CSIT 5220: Reasoning and Decision under Uncertainty L10: Model-Based Classification and Clustering Nevin

CSIT 5220Page 9

Page 10: THE HONG KONG UNIVERSITY OF SCIENCE & TECHNOLOGY CSIT 5220: Reasoning and Decision under Uncertainty L10: Model-Based Classification and Clustering Nevin

CSIT 5220Page 10

Page 11: THE HONG KONG UNIVERSITY OF SCIENCE & TECHNOLOGY CSIT 5220: Reasoning and Decision under Uncertainty L10: Model-Based Classification and Clustering Nevin

CSIT 5220Page 11

Naïve Bayes model often has good performance in practice

Drawbacks of Naïve Bayes: Attributes mutually independent given class variable

Often violated, leading to double counting.

Fixes: General BN classifiers

Tree augmented Naïve Bayes (TAN) models

Bayesian Networks for Classification

Page 12: THE HONG KONG UNIVERSITY OF SCIENCE & TECHNOLOGY CSIT 5220: Reasoning and Decision under Uncertainty L10: Model-Based Classification and Clustering Nevin

CSIT 5220Page 12

General BN classifier Treat class variable just as another variable

Learn a BN.

Classify the next instance based on values of variables in the Markov

blanket of the class variable.

Pretty bad because it does not utilize all available information because

of Markov boundary

Bayesian Networks for Classification

Page 13: THE HONG KONG UNIVERSITY OF SCIENCE & TECHNOLOGY CSIT 5220: Reasoning and Decision under Uncertainty L10: Model-Based Classification and Clustering Nevin

CSIT 5220Page 13

Bayesian Networks for Classification

Tree-Augmented Naïve Bayes (TAN) model Capture dependence among attributes using a tree structure.

During learning, First learn a tree among attributes: use Chow-Liu algorithm

Special structure learning problem, easy

Add class variable and estimate parameters

Classification arg max_c P(C=c|A1=a1, …, An=an)

BN inference

Many other methods

Page 14: THE HONG KONG UNIVERSITY OF SCIENCE & TECHNOLOGY CSIT 5220: Reasoning and Decision under Uncertainty L10: Model-Based Classification and Clustering Nevin

CSIT 5220

Task: Find a tree model over observed variables that has maximum

likelihood given data.

Maximized loglikelihood

Chow-Liu Trees

Page 15: THE HONG KONG UNIVERSITY OF SCIENCE & TECHNOLOGY CSIT 5220: Reasoning and Decision under Uncertainty L10: Model-Based Classification and Clustering Nevin

CSIT 5220

Page 16: THE HONG KONG UNIVERSITY OF SCIENCE & TECHNOLOGY CSIT 5220: Reasoning and Decision under Uncertainty L10: Model-Based Classification and Clustering Nevin

CSIT 5220

Page 17: THE HONG KONG UNIVERSITY OF SCIENCE & TECHNOLOGY CSIT 5220: Reasoning and Decision under Uncertainty L10: Model-Based Classification and Clustering Nevin

CSIT 5220

Page 18: THE HONG KONG UNIVERSITY OF SCIENCE & TECHNOLOGY CSIT 5220: Reasoning and Decision under Uncertainty L10: Model-Based Classification and Clustering Nevin

CSIT 5220

Page 19: THE HONG KONG UNIVERSITY OF SCIENCE & TECHNOLOGY CSIT 5220: Reasoning and Decision under Uncertainty L10: Model-Based Classification and Clustering Nevin

CSIT 5220

Page 20: THE HONG KONG UNIVERSITY OF SCIENCE & TECHNOLOGY CSIT 5220: Reasoning and Decision under Uncertainty L10: Model-Based Classification and Clustering Nevin

CSIT 5220

Page 21: THE HONG KONG UNIVERSITY OF SCIENCE & TECHNOLOGY CSIT 5220: Reasoning and Decision under Uncertainty L10: Model-Based Classification and Clustering Nevin

CSIT 5220

Mutual Information

Chow-Liu Trees

Task is equivalent to finding maximum spanning tree of the following weighted and undirected graph:

Page 22: THE HONG KONG UNIVERSITY OF SCIENCE & TECHNOLOGY CSIT 5220: Reasoning and Decision under Uncertainty L10: Model-Based Classification and Clustering Nevin

CSIT 5220

Maximum Spanning Trees

Page 23: THE HONG KONG UNIVERSITY OF SCIENCE & TECHNOLOGY CSIT 5220: Reasoning and Decision under Uncertainty L10: Model-Based Classification and Clustering Nevin

CSIT 5220

http://www.cs.cmu.edu/~guestrin/Class/15781/recitations/r10/11152007chowliu.pdf

Illustration of Kruskal’s Algorithm

Page 24: THE HONG KONG UNIVERSITY OF SCIENCE & TECHNOLOGY CSIT 5220: Reasoning and Decision under Uncertainty L10: Model-Based Classification and Clustering Nevin

CSIT 5220

L10: Probabilistic Models (PMs) for Classification and Clustering

Page 24

Probabilistic Models (PMs) for Classification

PMs for Clustering

Page 25: THE HONG KONG UNIVERSITY OF SCIENCE & TECHNOLOGY CSIT 5220: Reasoning and Decision under Uncertainty L10: Model-Based Classification and Clustering Nevin

CSIT 5220Page 25

Page 26: THE HONG KONG UNIVERSITY OF SCIENCE & TECHNOLOGY CSIT 5220: Reasoning and Decision under Uncertainty L10: Model-Based Classification and Clustering Nevin

CSIT 5220Page 26

Page 27: THE HONG KONG UNIVERSITY OF SCIENCE & TECHNOLOGY CSIT 5220: Reasoning and Decision under Uncertainty L10: Model-Based Classification and Clustering Nevin

CSIT 5220Page 27

Page 28: THE HONG KONG UNIVERSITY OF SCIENCE & TECHNOLOGY CSIT 5220: Reasoning and Decision under Uncertainty L10: Model-Based Classification and Clustering Nevin

CSIT 5220Page 28

Page 29: THE HONG KONG UNIVERSITY OF SCIENCE & TECHNOLOGY CSIT 5220: Reasoning and Decision under Uncertainty L10: Model-Based Classification and Clustering Nevin

CSIT 5220Page 29

Page 30: THE HONG KONG UNIVERSITY OF SCIENCE & TECHNOLOGY CSIT 5220: Reasoning and Decision under Uncertainty L10: Model-Based Classification and Clustering Nevin

CSIT 5220Page 30

Page 31: THE HONG KONG UNIVERSITY OF SCIENCE & TECHNOLOGY CSIT 5220: Reasoning and Decision under Uncertainty L10: Model-Based Classification and Clustering Nevin

CSIT 5220Page 31

Page 32: THE HONG KONG UNIVERSITY OF SCIENCE & TECHNOLOGY CSIT 5220: Reasoning and Decision under Uncertainty L10: Model-Based Classification and Clustering Nevin

CSIT 5220Page 32

Page 33: THE HONG KONG UNIVERSITY OF SCIENCE & TECHNOLOGY CSIT 5220: Reasoning and Decision under Uncertainty L10: Model-Based Classification and Clustering Nevin

CSIT 5220

An Medical Application

In medical diagnosis, sometimes gold standard exists

Example: Lung Cancer

Symptoms: Persistent cough, Hemoptysis (Coughing up blood), Constant chest

pain, Shortness of breath, Fatigue, etc

Information for diagnosis: symptoms, medical history, smoking

history, X-ray, sputum.

Gold standard: Biopsy: the removal of a small sample of tissue for examination under

a microscope by a pathologist

Page 34: THE HONG KONG UNIVERSITY OF SCIENCE & TECHNOLOGY CSIT 5220: Reasoning and Decision under Uncertainty L10: Model-Based Classification and Clustering Nevin

CSIT 5220

An Medical Application

Sometimes gold standard does not exist

Example: Rheumatoid Arthritis (RA)

Symptoms: Back Pain, Neck Pain, Joint Pain, Joint Swelling, Morning Joint

Stiffness, etc

Information for diagnosis: Symptoms, medical history, physical exam,

Lab tests including a test for rheumatoid factor.

(Rheumatoid factor is an antibody found in the blood of about 80 percent of

adults with RA. )

No gold standard: None of the symptoms or their combinations are not clear-cut indicators of RA

The presence or absence of rheumatoid factor does not indicate that one has RA.

Page 35: THE HONG KONG UNIVERSITY OF SCIENCE & TECHNOLOGY CSIT 5220: Reasoning and Decision under Uncertainty L10: Model-Based Classification and Clustering Nevin

CSIT 5220

LC Analysis of Hannover Rheumatoid Arthritis Data

Class specific probabilities

Cluster 1: “disease” free

Cluster 2: “back-pain type”

Cluster 3: “Joint type”

Cluster 4: “Severe type”