bayesian networks martin bachler [email protected] mla - vo 06.12.2005

42
Bayesian Networks Martin Bachler [email protected] MLA - VO 06.12.2005

Upload: cori-hubbard

Post on 29-Jan-2016

226 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Bayesian Networks Martin Bachler martin.bachler@igi.tugraz.at MLA - VO 06.12.2005

Bayesian Networks

Martin [email protected]

MLA - VO06.12.2005

Page 2: Bayesian Networks Martin Bachler martin.bachler@igi.tugraz.at MLA - VO 06.12.2005

2

Page 3: Bayesian Networks Martin Bachler martin.bachler@igi.tugraz.at MLA - VO 06.12.2005

3

Overview

• „Microsoft‘s competitive advantage lies in its expertise in Bayesian networks“(Bill Gates, quoted in LA Times, 1996)

Page 4: Bayesian Networks Martin Bachler martin.bachler@igi.tugraz.at MLA - VO 06.12.2005

4

Overview

• (Recap of) Definitions

• Naive Bayes– Performance/Optimality ?– How important is independence ?– Linearity ?

• Bayesian networks

Page 5: Bayesian Networks Martin Bachler martin.bachler@igi.tugraz.at MLA - VO 06.12.2005

5

Definitions

• Conditional probability

• Bayes theorem

P A,BP A| B

P B

P A,BP B | A

P A

P A| B P B P B | A P A

P B | A P AP A| B

P B

Page 6: Bayesian Networks Martin Bachler martin.bachler@igi.tugraz.at MLA - VO 06.12.2005

6

Definitions• Bayes theorem

• Likelihood

• Prior probability

• normalization term

P B | A P AP A| B

P B

P B

P A

P B | A

Page 7: Bayesian Networks Martin Bachler martin.bachler@igi.tugraz.at MLA - VO 06.12.2005

7

Definitions

• Classification problem– Input space X={x1 x x2 x…x xn}

– Output space Y = {0,1}

– Target concept C:X→Y

– Hypothesis space H

• Bayesian way of classifying an instance :

1 n c Y

c Y

c Y

h ,..., arg max P( c | )

P( | c ) P( c )arg max

P( )

arg max P( | c ) P( c )

Page 8: Bayesian Networks Martin Bachler martin.bachler@igi.tugraz.at MLA - VO 06.12.2005

8

Definitions

• Theoretically OPTIMAL!

• For large n the estimation of is very hard!

• => Assumption: pairwise conditional independence between input-variables given C:

1 n c Y 1 nh ,..., arg max P( ,..., | c ) P( c )

1 nP( ,..., | c )

i j i jP( x ,x |C ) P( x |C ) P( x |C )

i, j 1,...,n;i j

Page 9: Bayesian Networks Martin Bachler martin.bachler@igi.tugraz.at MLA - VO 06.12.2005

9

Overview

• (Recap of) Definitions

• Naive Bayes– Performance/Optimality ?– How important is independence ?– Linearity ?

• Bayesian networks

Page 10: Bayesian Networks Martin Bachler martin.bachler@igi.tugraz.at MLA - VO 06.12.2005

10

Naive Bayes

n

1 n c C ii 1

h ,..., arg max P( | c ) P( c )

i j i jP( x ,x |C ) P( x |C ) P( x |C )

i, j 1,...,n;i j

n

1 2 n 1 2 n ii 1

P( x ,x ,...,x |C ) P( x |C ) P( x |C ) ... P( x |C ) P( x |C )

Page 11: Bayesian Networks Martin Bachler martin.bachler@igi.tugraz.at MLA - VO 06.12.2005

11

Example

1/41100

………………

2/3

1

2/3

P(x2|C)C

1/3

0

2/3

P(x1|C)

3/4

1/4

3/4

P(C)

10

01

11

x2x1

0 1

1

0

1

1

h 1,1 arg max[ P( x 1|C 1) P( x2 1|C 1) P( C 1),

P( x 1|C 0 ) P( x2 1|C 0 ) P( C 0 )]

1

h 1,0 arg max[...,...] 1

h 0,1 arg max[...,...] 1

h 0,0 arg max[...,...] 0

n

1 n c C ii 1

h ,..., arg max P( | c ) P( c )

Page 12: Bayesian Networks Martin Bachler martin.bachler@igi.tugraz.at MLA - VO 06.12.2005

12

Naive Bayes - Independence

• The independence assumption is very strict!

• For most practical problems it is blatantly wrong!(not even fulfilled in the previous example!...see later)

=> Is naive Bayes a rather „academic“ algorithm ?

Page 13: Bayesian Networks Martin Bachler martin.bachler@igi.tugraz.at MLA - VO 06.12.2005

13

Naive Bayes - Independence

• For which problems is naive Bayes optimal ?(Lets assume for the moment we can perfectly

estimate all necessary probabilites)

• Guess: For problems for which the independence assumption holds

• Let‘s check… (empirically + theoretically)

Page 14: Bayesian Networks Martin Bachler martin.bachler@igi.tugraz.at MLA - VO 06.12.2005

14

Independence - Example

1111000

1/31/31/90100

0100010

2/31/31/91/3110

1000001

1/32/32/91/3101

0000011

2/32/34/91/3111

P(x2|C)P(x1|C)P(x1|C)P(x2|C)P(x1,x2|C)Cx2x1

0 1

1

0 1 2C x x

Page 15: Bayesian Networks Martin Bachler martin.bachler@igi.tugraz.at MLA - VO 06.12.2005

15

Independence - Example 1 2C x x

Page 16: Bayesian Networks Martin Bachler martin.bachler@igi.tugraz.at MLA - VO 06.12.2005

16

Independence - Example

1/21/21/41/2000

1/21/21/40100

1/21/21/40010

1/21/21/41/2110

1/21/21/40001

1/21/21/41/2101

1/21/21/41/2011

1/21/21/40111

P(x2|C)P(x1|C)P(x1|C)P(x2|C)P(x1,x2|C)Cx2x1

0 1

1

0 1 2C x x

Page 17: Bayesian Networks Martin Bachler martin.bachler@igi.tugraz.at MLA - VO 06.12.2005

17

Independence - Example 1 2C x x

Page 18: Bayesian Networks Martin Bachler martin.bachler@igi.tugraz.at MLA - VO 06.12.2005

18

Naive Bayes - Independence

[1] Domingos, Pazzani, Beyond independence, Conditions for the optimality of the simple Bayesian classifier, 1996

Page 19: Bayesian Networks Martin Bachler martin.bachler@igi.tugraz.at MLA - VO 06.12.2005

19

Naive Bayes - Independence

i j i j i jD x ,x |C H x |C H x |C H x x |C

Page 20: Bayesian Networks Martin Bachler martin.bachler@igi.tugraz.at MLA - VO 06.12.2005

20

Naive Bayes - Independence

• For which problems is naive Bayes optimal ?

• Guess:For problems for which the independence assumption holds

• Empirical answer: Not really….

• Theoretical answer ?

Page 21: Bayesian Networks Martin Bachler martin.bachler@igi.tugraz.at MLA - VO 06.12.2005

21

Naive Bayes - optimality

• Example: 3 features x1, x2, x3

• P(c=0) = P(c=1)

• x1, x3 independent; x2 = x1 (totally dep.)

=> optimal classification:

naive Bayes:

[1] Domingos, Pazzani, Beyond independence, Conditions for the optimality of the simple Bayesian classifier, 1996

opt 1 3 1 3

1 3 1 3

h sgn P x |1 P x |1 P x |0 P x |0

sgn P 1| x P 1| x P 0 | x P 0 | x

2 2

nb 1 3 1 3

2 2

1 3 1 3

h sgn P x |1 P x |1 P x |0 P x |0

sgn P 1| x P 1| x P 0 | x P 0 | x

Page 22: Bayesian Networks Martin Bachler martin.bachler@igi.tugraz.at MLA - VO 06.12.2005

22

Naive Bayes - optimality

• Let p =P(1|x1), q = P(1|x3)

• optimal:

• naive Bayes:

opth sgn p q (1 p ) (1 q )

nbh sgn p² q (1 p )² (1 q )

independence assumption holds

optimal and naive classifier disagree only

here

Page 23: Bayesian Networks Martin Bachler martin.bachler@igi.tugraz.at MLA - VO 06.12.2005

23

Naive Bayes - optimality

• In general: Instance x = <x1,…,xn>

Let

Theorem 1:A naive Bayesian classifier is optimal for x, iff

n

ii 1

n

ii 1

p P(1| x )

r P(1) / P( x ) P( x |1)

s P(0 ) / P( x ) P( x |0 )

1 1p r s p r s

2 2

Page 24: Bayesian Networks Martin Bachler martin.bachler@igi.tugraz.at MLA - VO 06.12.2005

24

Naive Bayes - optimality

region of optimality

independence assumption holds

only here

Page 25: Bayesian Networks Martin Bachler martin.bachler@igi.tugraz.at MLA - VO 06.12.2005

25

Naive Bayes - optimality

• This is a criterion for local optimality ( instance)

• What about global optimality ?Theorem 2: The naive Bayesian classifier is globally

optimal for a dataset Ѕ iff

x x x x x x

1 1x S : p r s p r s

2 2

Page 26: Bayesian Networks Martin Bachler martin.bachler@igi.tugraz.at MLA - VO 06.12.2005

26

Naive Bayes - optimality• What is the reason for this ?

– Difference between classification and probability (distribution) estimation

– I.e. for classification the perfect estimation of probabilities is not important as long as for each instance the maximum estimate corresponds to the maximum true probability.

• Problem with this result: Verification of global optimality (optimality for all instances) ?

Page 27: Bayesian Networks Martin Bachler martin.bachler@igi.tugraz.at MLA - VO 06.12.2005

27

Naive Bayes - optimality

• For which problems is naive Bayes optimal ?

• Guess:For problems for which the independence assumption holds

• Empirical answer: Not really….

• Theoretical answer no 1:For all problems for for which Theorem 2 holds.

Page 28: Bayesian Networks Martin Bachler martin.bachler@igi.tugraz.at MLA - VO 06.12.2005

28

Naive Bayes - linearity

• other question:

how does naive Bayes‘ hypothesis depend on the input variables ?

• Consider simple case of binary variables only…

• It can be shown (e.g.[2]) that in binary domains naive Bayes is LINEAR in the input variables!!

[2]: Duda, Hart: Pattern classification and Scene Analysis, Wiley, 1973

Page 29: Bayesian Networks Martin Bachler martin.bachler@igi.tugraz.at MLA - VO 06.12.2005

29

Naive Bayes - linearity

• Proof…

Page 30: Bayesian Networks Martin Bachler martin.bachler@igi.tugraz.at MLA - VO 06.12.2005

30

Naive Bayes – linearity - examples

naive Bayes

Perceptron

Page 31: Bayesian Networks Martin Bachler martin.bachler@igi.tugraz.at MLA - VO 06.12.2005

31

Naive Bayes – linearity - examples

Page 32: Bayesian Networks Martin Bachler martin.bachler@igi.tugraz.at MLA - VO 06.12.2005

32

Naive Bayes - linearity• For boolean domains naive Bayes‘ hypothesis is

a linear hyperplane!

=> It can only be globally optimal for linearly separable problems!!

BUT: It is not optimal for all linearly separable problems! (e.g. not for certain m-out-of-n concepts)

Page 33: Bayesian Networks Martin Bachler martin.bachler@igi.tugraz.at MLA - VO 06.12.2005

33

Naive Bayes - optimality• For which problems is naive Bayes optimal ?• Guess:

For problems for which the independence assumption holds

• Empirical answer: Not really….

• Theoretical answer no 1:For all problems for for which Theorem 2 holds.

• Theoretical answer no 2: For a (large) subset of the set of linearly separable problems.

Page 34: Bayesian Networks Martin Bachler martin.bachler@igi.tugraz.at MLA - VO 06.12.2005

34

Naive Bayes - optimality

class of concepts for which perceptron is optimal

class of concepts for which naive Bayes is optimal

Page 35: Bayesian Networks Martin Bachler martin.bachler@igi.tugraz.at MLA - VO 06.12.2005

35

Overview

• (Recap of) Definitions

• Naive Bayes– Performance/Optimality ?– How important is independence ?– Linearity ?

• Bayesian networks

Page 36: Bayesian Networks Martin Bachler martin.bachler@igi.tugraz.at MLA - VO 06.12.2005

36

Bayesian networks

• The problem-class for which naive Bayes is optimal is quite small….

• Idea: Relax the independence-assumption to obtain a more general classifier

• I.e. model cond. dependencies between variables

• Different techniques (e.g. hidden variables,…)

• Most established: Bayesian networks

Page 37: Bayesian Networks Martin Bachler martin.bachler@igi.tugraz.at MLA - VO 06.12.2005

37

Bayesian networks

• Bayesian network:– tool for representing statistical dependencies

between a set of random variables– acyclic directed graph– one vertex for each variable– for each pair of stat. dependent variables there is

an edge in the graph between the corresponding vertices

– not connected variables(vertices) are independent!– each vertex has a table of local probability

distributions

Page 38: Bayesian Networks Martin Bachler martin.bachler@igi.tugraz.at MLA - VO 06.12.2005

38

Bayesian networks

• Each variable is dependent only on its parents in the network!

y

x1

x3x2 x4

x5

„parents“ of x4 (Pa4)

i l i i i l 1 n i iP( x | x ,Pa ) P( x | Pa ); x { x ,....,x }\ x Pa

Page 39: Bayesian Networks Martin Bachler martin.bachler@igi.tugraz.at MLA - VO 06.12.2005

39

Bayesian networks

Bayesian network – based classifier:

y

x1

x3x2 x4

x5

n

1 n c C i ii 1

h ,..., arg max P( | c,Pa ) P( c )

1 n y 1 2 2

3 4 3 5 4

h ,..., arg max [ P( | , y ) P( | y )

P( | y ) P( | , y ) P( | , y ) P( y )]

Page 40: Bayesian Networks Martin Bachler martin.bachler@igi.tugraz.at MLA - VO 06.12.2005

40

Bayesian networks

• In the case of boolean attributes this is again linear, but not on the input-variables:

• Linear on product-features:

1 mi

n n

c Y i i i i i ii 1i 1

h( ) arg max P( | c,Pa ) P(c) sgn w [x Pa ... Pa ] b

å

Page 41: Bayesian Networks Martin Bachler martin.bachler@igi.tugraz.at MLA - VO 06.12.2005

41

Bayesian networks

• The difficulty here is to estimate the correct network-structure (and probability-parameters) from training data!

• For general Bayesian networks this problem is NP-hard!

• There exist numerous heuristics for learning Bayesian networks from data!

Page 42: Bayesian Networks Martin Bachler martin.bachler@igi.tugraz.at MLA - VO 06.12.2005

42

References[1] Domingos, Pazzani, Beyond independence, Conditions for the optimality of the simple Bayesian classifier, 1996

[2] Duda, Hart: Pattern classification and Scene Analysis, Wiley, 1973