bayesian approaches - queen's universityresearch.cs.queensu.ca/home/xiao/doc/dm/bayes.pdf ·...

30
Bayesian Approaches Data Mining Selected Technique Henry Xiao [email protected] School of Computing Queen’s University Henry Xiao CISC 873 Data Mining – p. 1/17

Upload: tranduong

Post on 14-Apr-2019

224 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Bayesian Approaches - Queen's Universityresearch.cs.queensu.ca/home/xiao/doc/dm/Bayes.pdf · Bayesian Approaches Data Mining Selected Technique ... Bayesian Networks ... A Bayes net

Bayesian ApproachesData Mining Selected Technique

Henry Xiao

[email protected]

School of Computing

Queen’s University

Henry Xiao • CISC 873 Data Mining – p. 1/17

Page 2: Bayesian Approaches - Queen's Universityresearch.cs.queensu.ca/home/xiao/doc/dm/Bayes.pdf · Bayesian Approaches Data Mining Selected Technique ... Bayesian Networks ... A Bayes net

Probabilistic BasesReview the fundamentals of Probability Theory.

A is a Boolean-valued variable if A denotes an event, and there issome degree of uncertainty as to whether A occurs.

P (A) denotes the fraction of possible worlds in which A is true.

The axioms of general probability:

0 ≤ P (A) ≤ 1.

P (true) = 1.

P (false) = 0.

P (¬A) + P (A) = 1.

P (A ∪ B) = P (A) + P (B) − P (A ∩ B).

Henry Xiao • CISC 873 Data Mining – p. 2/17

Page 3: Bayesian Approaches - Queen's Universityresearch.cs.queensu.ca/home/xiao/doc/dm/Bayes.pdf · Bayesian Approaches Data Mining Selected Technique ... Bayesian Networks ... A Bayes net

Probabilistic BasesMultivalued Random Variables.

A is a random variable with arity k if it can take on exactly onevalue out of v1, v2, . . . , vk.

P (A = vi ∩ A = vj) = 0 if i 6= j.

P (A = v1 ∪ A = v2 ∪ . . . A = vk) =∑k

i=1 P (A = vi) = 1.

P (B) =∑k

i=1 P (B ∩ A = vi).

Henry Xiao • CISC 873 Data Mining – p. 3/17

Page 4: Bayesian Approaches - Queen's Universityresearch.cs.queensu.ca/home/xiao/doc/dm/Bayes.pdf · Bayesian Approaches Data Mining Selected Technique ... Bayesian Networks ... A Bayes net

Bayes RuleConditional Probability and Bayes Rule.

P (A|B) is a fraction of worlds in which A is true given B is true.

P (A|B) = P (A∩B)P (B) =⇒ P (A ∩ B) = P (A|B)P (B) (Chain Rule)

P (B|A) = P (A|B)P (B)P (A) (Bayes Rule)

P (B|A) = P (A|B)P (B)P (A|B)P (B)+P (A|¬B)P (¬B).

P (B|A ∩ C) = P (A|B∩C)P (B∩C)P (A∩C) .

Henry Xiao • CISC 873 Data Mining – p. 4/17

Page 5: Bayesian Approaches - Queen's Universityresearch.cs.queensu.ca/home/xiao/doc/dm/Bayes.pdf · Bayesian Approaches Data Mining Selected Technique ... Bayesian Networks ... A Bayes net

Density EstimationA Density Estimator M learns a mapping from a set of attributesto a Probability

Henry Xiao • CISC 873 Data Mining – p. 5/17

Page 6: Bayesian Approaches - Queen's Universityresearch.cs.queensu.ca/home/xiao/doc/dm/Bayes.pdf · Bayesian Approaches Data Mining Selected Technique ... Bayesian Networks ... A Bayes net

Density EstimationA Density Estimator M learns a mapping from a set of attributesto a Probability

Given a record x, M can tell you how likely the record is: P (x|M).

Henry Xiao • CISC 873 Data Mining – p. 5/17

Page 7: Bayesian Approaches - Queen's Universityresearch.cs.queensu.ca/home/xiao/doc/dm/Bayes.pdf · Bayesian Approaches Data Mining Selected Technique ... Bayesian Networks ... A Bayes net

Density EstimationA Density Estimator M learns a mapping from a set of attributesto a Probability

Given a record x, M can tell you how likely the record is: P (x|M).

Given a dataset with R records, M can tell you how likely thedataset is:P (dataset|M) = P (x1 ∩ x2 ∩ . . . xR|M) =

∏R

k=1 P (xk|M).

Joint Distribution Estimator.

Naïve Estimator.

Henry Xiao • CISC 873 Data Mining – p. 5/17

Page 8: Bayesian Approaches - Queen's Universityresearch.cs.queensu.ca/home/xiao/doc/dm/Bayes.pdf · Bayesian Approaches Data Mining Selected Technique ... Bayesian Networks ... A Bayes net

Density EstimationA Density Estimator M learns a mapping from a set of attributesto a Probability

Given a record x, M can tell you how likely the record is: P (x|M).

Given a dataset with R records, M can tell you how likely thedataset is:P (dataset|M) = P (x1 ∩ x2 ∩ . . . xR|M) =

∏R

k=1 P (xk|M).

Joint Distribution Estimator.

Naïve Estimator.

Problem:

The Joint Estimator just mirrors the training data - overfitting.

Naïve Estimator assumes each attribute is independent.

Henry Xiao • CISC 873 Data Mining – p. 5/17

Page 9: Bayesian Approaches - Queen's Universityresearch.cs.queensu.ca/home/xiao/doc/dm/Bayes.pdf · Bayesian Approaches Data Mining Selected Technique ... Bayesian Networks ... A Bayes net

Bayes ClassifiersBuild a Bayes Classifier.

Henry Xiao • CISC 873 Data Mining – p. 6/17

Page 10: Bayesian Approaches - Queen's Universityresearch.cs.queensu.ca/home/xiao/doc/dm/Bayes.pdf · Bayesian Approaches Data Mining Selected Technique ... Bayesian Networks ... A Bayes net

Bayes ClassifiersBuild a Bayes Classifier.

Assume we want to predict Y which has arity n and valuesv1, v2, . . . , vn.

Assume there are m input attributes X1, X2, . . . , Xm.

Henry Xiao • CISC 873 Data Mining – p. 6/17

Page 11: Bayesian Approaches - Queen's Universityresearch.cs.queensu.ca/home/xiao/doc/dm/Bayes.pdf · Bayesian Approaches Data Mining Selected Technique ... Bayesian Networks ... A Bayes net

Bayes ClassifiersBuild a Bayes Classifier.

Assume we want to predict Y which has arity n and valuesv1, v2, . . . , vn.

Assume there are m input attributes X1, X2, . . . , Xm.

Break dataset into n smaller datasets DS1, DS2, . . . , DSn.

Henry Xiao • CISC 873 Data Mining – p. 6/17

Page 12: Bayesian Approaches - Queen's Universityresearch.cs.queensu.ca/home/xiao/doc/dm/Bayes.pdf · Bayesian Approaches Data Mining Selected Technique ... Bayesian Networks ... A Bayes net

Bayes ClassifiersBuild a Bayes Classifier.

Assume we want to predict Y which has arity n and valuesv1, v2, . . . , vn.

Assume there are m input attributes X1, X2, . . . , Xm.

Break dataset into n smaller datasets DS1, DS2, . . . , DSn.

Let DSi contains the records where Y = vi.

Henry Xiao • CISC 873 Data Mining – p. 6/17

Page 13: Bayesian Approaches - Queen's Universityresearch.cs.queensu.ca/home/xiao/doc/dm/Bayes.pdf · Bayesian Approaches Data Mining Selected Technique ... Bayesian Networks ... A Bayes net

Bayes ClassifiersBuild a Bayes Classifier.

Assume we want to predict Y which has arity n and valuesv1, v2, . . . , vn.

Assume there are m input attributes X1, X2, . . . , Xm.

Break dataset into n smaller datasets DS1, DS2, . . . , DSn.

Let DSi contains the records where Y = vi.

For each DSi, learn Density Estimator Mi to model the inputdistribution among the Y = vi records.

Henry Xiao • CISC 873 Data Mining – p. 6/17

Page 14: Bayesian Approaches - Queen's Universityresearch.cs.queensu.ca/home/xiao/doc/dm/Bayes.pdf · Bayesian Approaches Data Mining Selected Technique ... Bayesian Networks ... A Bayes net

Bayes ClassifiersBuild a Bayes Classifier.

Assume we want to predict Y which has arity n and valuesv1, v2, . . . , vn.

Assume there are m input attributes X1, X2, . . . , Xm.

Break dataset into n smaller datasets DS1, DS2, . . . , DSn.

Let DSi contains the records where Y = vi.

For each DSi, learn Density Estimator Mi to model the inputdistribution among the Y = vi records.

Mi estimates P (X1, X2, . . . , Xm|Y = vi).

Henry Xiao • CISC 873 Data Mining – p. 6/17

Page 15: Bayesian Approaches - Queen's Universityresearch.cs.queensu.ca/home/xiao/doc/dm/Bayes.pdf · Bayesian Approaches Data Mining Selected Technique ... Bayesian Networks ... A Bayes net

Bayes ClassifiersWhen a new set of input values (X1 = u1, X2 = u2, . . . , Xm = um)

come to be evaluated, predict the value of Y.

Henry Xiao • CISC 873 Data Mining – p. 7/17

Page 16: Bayesian Approaches - Queen's Universityresearch.cs.queensu.ca/home/xiao/doc/dm/Bayes.pdf · Bayesian Approaches Data Mining Selected Technique ... Bayesian Networks ... A Bayes net

Bayes ClassifiersWhen a new set of input values (X1 = u1, X2 = u2, . . . , Xm = um)

come to be evaluated, predict the value of Y.

Y predict = arg maxv P (X1 = u1, . . . , Xm = um|Y =

v) =⇒ Maximum Likelihood Estimation (MLE).

Henry Xiao • CISC 873 Data Mining – p. 7/17

Page 17: Bayesian Approaches - Queen's Universityresearch.cs.queensu.ca/home/xiao/doc/dm/Bayes.pdf · Bayesian Approaches Data Mining Selected Technique ... Bayesian Networks ... A Bayes net

Bayes ClassifiersWhen a new set of input values (X1 = u1, X2 = u2, . . . , Xm = um)

come to be evaluated, predict the value of Y.

Y predict = arg maxv P (X1 = u1, . . . , Xm = um|Y =

v) =⇒ Maximum Likelihood Estimation (MLE).

Y predict = arg maxv P (Y = v|X1 = u1, . . . , Xm = um) =⇒ BayesEstimation (BE)

Henry Xiao • CISC 873 Data Mining – p. 7/17

Page 18: Bayesian Approaches - Queen's Universityresearch.cs.queensu.ca/home/xiao/doc/dm/Bayes.pdf · Bayesian Approaches Data Mining Selected Technique ... Bayesian Networks ... A Bayes net

Bayes ClassifiersWhen a new set of input values (X1 = u1, X2 = u2, . . . , Xm = um)

come to be evaluated, predict the value of Y.

Y predict = arg maxv P (X1 = u1, . . . , Xm = um|Y =

v) =⇒ Maximum Likelihood Estimation (MLE).

Y predict = arg maxv P (Y = v|X1 = u1, . . . , Xm = um) =⇒ BayesEstimation (BE)

Posterior probability:

P (Y = v|X1 = u1, . . . , Xm = um)

=P (X1 = u1, . . . , Xm = um|Y = v)P (Y = v)

P (X1 = u1, . . . , Xm = um)

=P (X1 = u1, . . . , Xm = um|Y = v)P (Y = v)

∑ni=1 P (X1 = u1, . . . , Xm = um|Y = vi)P (Y = vi)

.

Henry Xiao • CISC 873 Data Mining – p. 7/17

Page 19: Bayesian Approaches - Queen's Universityresearch.cs.queensu.ca/home/xiao/doc/dm/Bayes.pdf · Bayesian Approaches Data Mining Selected Technique ... Bayesian Networks ... A Bayes net

Bayes ClassifiersBayes Classifier Model.

Learn the distribution over inputs for each value Y *.

Calculate P (X1, X2, . . . , Xm|Y = vi)

Estimate P (Y = vi) as a fraction of records with Y = vi.

For a new prediction:Y predict = arg maxv P (X1 = u1, . . . , Xm = um|Y = v)P (Y = v).

* Learn the distribution is done by a Density Estimator here.

Henry Xiao • CISC 873 Data Mining – p. 8/17

Page 20: Bayesian Approaches - Queen's Universityresearch.cs.queensu.ca/home/xiao/doc/dm/Bayes.pdf · Bayesian Approaches Data Mining Selected Technique ... Bayesian Networks ... A Bayes net

Gaussian BayesClassifier

Gaussian Bayes Classifier Bases.

Generate the output by drawing yi ∼ Multinomial(p1, p2, . . . , pn)

Generate the inputs from a Gaussian PDF that depends on thevalue of yi: xi ∼ N(µi, Σi).

P (y = i|x) = p(x|y=i)p(y=i)p(x) .

P (y = i|x) =1

(2π)m/2‖Si‖1/2

e− 1

2(xk−µi)

T Si(xk−µi)pi

p(x)

where µi and Si is the mean and variance by taking MLEGaussian (µmle

i , Σmlei ).

Henry Xiao • CISC 873 Data Mining – p. 9/17

Page 21: Bayesian Approaches - Queen's Universityresearch.cs.queensu.ca/home/xiao/doc/dm/Bayes.pdf · Bayesian Approaches Data Mining Selected Technique ... Bayesian Networks ... A Bayes net

Gaussian BayesClassifier

Gaussian BC Discussion.

Number of parameters is quadratic with number of dimensionsgenerally because of the covariance matrix.

Normalize Gaussian may give O(1) covariance parameters.

Gaussian DE and BC inputs require to be Real-valued.

Gaussian Naïve or Gaussian Joint can accept both categoricaland real-valued inputs.

Henry Xiao • CISC 873 Data Mining – p. 10/17

Page 22: Bayesian Approaches - Queen's Universityresearch.cs.queensu.ca/home/xiao/doc/dm/Bayes.pdf · Bayesian Approaches Data Mining Selected Technique ... Bayesian Networks ... A Bayes net

Bayes ClassifiersMethods

Different Bayes Classifiers have been developed.

Probability

Density EstimationPDFs

MLE Gaussians

MLE of Gaussians

Bayes Classifiers

Gaussion BC

BC Road Map

Joint Density Bayes Classifier.

Naïve Bayes Classifier.

Gaussian Bayes Classifier.

Other joint Bayes Classifier possible.

Henry Xiao • CISC 873 Data Mining – p. 11/17

Page 23: Bayesian Approaches - Queen's Universityresearch.cs.queensu.ca/home/xiao/doc/dm/Bayes.pdf · Bayesian Approaches Data Mining Selected Technique ... Bayesian Networks ... A Bayes net

Bayesian NetworksIntroduction to Bayes Nets. (A Bayes Net Example)

A Bayes net (or Belief network) is an augmented directed acyclicgraph, represented by the vertex set V and directed edge set E.

There is no loops allowed. And each vertex v ∈ V represents arandom variable.

A probability distribution table indicating how the probability of thisvariable’s values depends on all possible combinations of parentalvalues.

Two variable vi and vj may still correlated even if they are notconnected.

Each variable vi is conditionally independent of allnon-descendants, given its parents.

Henry Xiao • CISC 873 Data Mining – p. 12/17

Page 24: Bayesian Approaches - Queen's Universityresearch.cs.queensu.ca/home/xiao/doc/dm/Bayes.pdf · Bayesian Approaches Data Mining Selected Technique ... Bayesian Networks ... A Bayes net

Bayesian NetworksBuilding a Bayes Net.

Henry Xiao • CISC 873 Data Mining – p. 13/17

Page 25: Bayesian Approaches - Queen's Universityresearch.cs.queensu.ca/home/xiao/doc/dm/Bayes.pdf · Bayesian Approaches Data Mining Selected Technique ... Bayesian Networks ... A Bayes net

Bayesian NetworksBuilding a Bayes Net.

Choose a set of relevant variables.

Choose an ordering for them.

Henry Xiao • CISC 873 Data Mining – p. 13/17

Page 26: Bayesian Approaches - Queen's Universityresearch.cs.queensu.ca/home/xiao/doc/dm/Bayes.pdf · Bayesian Approaches Data Mining Selected Technique ... Bayesian Networks ... A Bayes net

Bayesian NetworksBuilding a Bayes Net.

Choose a set of relevant variables.

Choose an ordering for them.

Assume the variables are X1, X2, . . . , Xn (where X1 is the first,Xi is the ith).

for i = 1 to n:

Add the Xi vertex to the network

Set Parent(Xi) to be a minimal subset of X1, . . . , Xi−1, suchthat we have conditional independence of Xi and all othermembers of X1, . . . , Xi−1 given Parents(Xi)

Define the probability table ofP (Xi = k | Assignments of Parent(Xi)).

Henry Xiao • CISC 873 Data Mining – p. 13/17

Page 27: Bayesian Approaches - Queen's Universityresearch.cs.queensu.ca/home/xiao/doc/dm/Bayes.pdf · Bayesian Approaches Data Mining Selected Technique ... Bayesian Networks ... A Bayes net

Bayesian NetworksBayesian Networks Discussion.

Independence and conditional independence.

Inference can be calculated.

Enumerating entries is exponential in the number of variables.

The stochastic simulation and likelihood weighting.

Henry Xiao • CISC 873 Data Mining – p. 14/17

Page 28: Bayesian Approaches - Queen's Universityresearch.cs.queensu.ca/home/xiao/doc/dm/Bayes.pdf · Bayesian Approaches Data Mining Selected Technique ... Bayesian Networks ... A Bayes net

ConclusionTopics we have discussed here:

Probabilistic bases and Bayes Rule

Density Estimation

Bayes Model

Bayes Classifiers

Bayesian Networks Bayes, Thomas(1763)

Henry Xiao • CISC 873 Data Mining – p. 15/17

Page 29: Bayesian Approaches - Queen's Universityresearch.cs.queensu.ca/home/xiao/doc/dm/Bayes.pdf · Bayesian Approaches Data Mining Selected Technique ... Bayesian Networks ... A Bayes net

Ending

Questions regarding Bayesian Approaches?

Information Site: http://www.cs.queensu.ca/home/xiao/dm.html

E-mail: [email protected]

Thank you

Henry Xiao • CISC 873 Data Mining – p. 16/17

Page 30: Bayesian Approaches - Queen's Universityresearch.cs.queensu.ca/home/xiao/doc/dm/Bayes.pdf · Bayesian Approaches Data Mining Selected Technique ... Bayesian Networks ... A Bayes net

A Bayes NetA Bayes net example (Back to Bayes Nets)

P (V4|¬V2) = 0.6

V1P (V1) = 0.3

V2

P (V4|V2) = 0.3P (V3|V1 ∩ V2) = 0.05P (V3|¬V1 ∩ ¬V2) = 0.2P (V3|V1 ∩ ¬V2) = 0.1

V3

P (V3|¬V1 ∩ V2) = 0.1

P (V5|V3) = 0.3P (V5|¬V3) = 0.8

V5

V4

P (V2) = 0.6

Henry Xiao • CISC 873 Data Mining – p. 17/17