the hong kong university of science & technology csit 5220: reasoning and decision under...
TRANSCRIPT
THE HONG KONG UNIVERSITY OF SCIENCE & TECHNOLOGY
CSIT 5220: Reasoning and Decision under Uncertainty
L10: Model-Based Classification and Clustering
Nevin L. ZhangRoom 3504, phone: 2358-7015,
Email: [email protected] Home page
CSIT 5220
L10: Model-Based Classification and Clustering
Probabilistic Models (PMs) for Classification
PMs for Clustering
Page 2
CSIT 5220
The problem:
Given data:
Find mapping (A1, A2, …, An) |- C
Possible solutions
ANN
Decision tree (Quinlan)
…
(SVM: Continuous data)
Classification
CSIT 5220
Probabilistic Approach to Classification
CSIT 5220Page 5
Will Boss Play Tennis?
CSIT 5220Page 6
Will Boss Play Tennis?
CSIT 5220Page 7
CSIT 5220Page 8
CSIT 5220Page 9
CSIT 5220Page 10
CSIT 5220Page 11
Naïve Bayes model often has good performance in practice
Drawbacks of Naïve Bayes: Attributes mutually independent given class variable
Often violated, leading to double counting.
Fixes: General BN classifiers
Tree augmented Naïve Bayes (TAN) models
…
Bayesian Networks for Classification
CSIT 5220Page 12
General BN classifier Treat class variable just as another variable
Learn a BN.
Classify the next instance based on values of variables in the Markov
blanket of the class variable.
Pretty bad because it does not utilize all available information because
of Markov boundary
Bayesian Networks for Classification
CSIT 5220Page 13
Bayesian Networks for Classification
Tree-Augmented Naïve Bayes (TAN) model Capture dependence among attributes using a tree structure.
During learning, First learn a tree among attributes: use Chow-Liu algorithm
Special structure learning problem, easy
Add class variable and estimate parameters
Classification arg max_c P(C=c|A1=a1, …, An=an)
BN inference
Many other methods
CSIT 5220
Task: Find a tree model over observed variables that has maximum
likelihood given data.
Maximized loglikelihood
Chow-Liu Trees
CSIT 5220
CSIT 5220
CSIT 5220
CSIT 5220
CSIT 5220
CSIT 5220
CSIT 5220
Mutual Information
Chow-Liu Trees
Task is equivalent to finding maximum spanning tree of the following weighted and undirected graph:
CSIT 5220
Maximum Spanning Trees
CSIT 5220
http://www.cs.cmu.edu/~guestrin/Class/15781/recitations/r10/11152007chowliu.pdf
Illustration of Kruskal’s Algorithm
CSIT 5220
L10: Probabilistic Models (PMs) for Classification and Clustering
Page 24
Probabilistic Models (PMs) for Classification
PMs for Clustering
CSIT 5220Page 25
CSIT 5220Page 26
CSIT 5220Page 27
CSIT 5220Page 28
CSIT 5220Page 29
CSIT 5220Page 30
CSIT 5220Page 31
CSIT 5220Page 32
CSIT 5220
An Medical Application
In medical diagnosis, sometimes gold standard exists
Example: Lung Cancer
Symptoms: Persistent cough, Hemoptysis (Coughing up blood), Constant chest
pain, Shortness of breath, Fatigue, etc
Information for diagnosis: symptoms, medical history, smoking
history, X-ray, sputum.
Gold standard: Biopsy: the removal of a small sample of tissue for examination under
a microscope by a pathologist
CSIT 5220
An Medical Application
Sometimes gold standard does not exist
Example: Rheumatoid Arthritis (RA)
Symptoms: Back Pain, Neck Pain, Joint Pain, Joint Swelling, Morning Joint
Stiffness, etc
Information for diagnosis: Symptoms, medical history, physical exam,
Lab tests including a test for rheumatoid factor.
(Rheumatoid factor is an antibody found in the blood of about 80 percent of
adults with RA. )
No gold standard: None of the symptoms or their combinations are not clear-cut indicators of RA
The presence or absence of rheumatoid factor does not indicate that one has RA.
CSIT 5220
LC Analysis of Hannover Rheumatoid Arthritis Data
Class specific probabilities
Cluster 1: “disease” free
Cluster 2: “back-pain type”
Cluster 3: “Joint type”
Cluster 4: “Severe type”