bayesian networks martin bachler [email protected] mla - vo 06.12.2005

Bayesian Networks

Martin [email protected]

MLA - VO06.12.2005

3

Overview

• „Microsoft‘s competitive advantage lies in its expertise in Bayesian networks“(Bill Gates, quoted in LA Times, 1996)

4

Overview

• (Recap of) Definitions

• Naive Bayes– Performance/Optimality ?– How important is independence ?– Linearity ?

• Bayesian networks

6

Definitions• Bayes theorem

• Likelihood

• Prior probability

• normalization term

P B | A P AP A| B

P B

P B

P A

P B | A

7

Definitions

• Classification problem– Input space X={x1 x x2 x…x xn}

– Output space Y = {0,1}

– Target concept C:X→Y

– Hypothesis space H

• Bayesian way of classifying an instance :

1 n c Y

c Y

c Y

h ,..., arg max P( c | )

P( | c ) P( c )arg max

P( )

arg max P( | c ) P( c )

8

Definitions

• Theoretically OPTIMAL!

• For large n the estimation of is very hard!

• => Assumption: pairwise conditional independence between input-variables given C:

1 n c Y 1 nh ,..., arg max P( ,..., | c ) P( c )

1 nP( ,..., | c )

i j i jP( x ,x |C ) P( x |C ) P( x |C )

i, j 1,...,n;i j

9

Overview




10

Naive Bayes

n

1 n c C ii 1

h ,..., arg max P( | c ) P( c )

i j i jP( x ,x |C ) P( x |C ) P( x |C )

i, j 1,...,n;i j

n

1 2 n 1 2 n ii 1

P( x ,x ,...,x |C ) P( x |C ) P( x |C ) ... P( x |C ) P( x |C )

11

Example

1/41100

………………

2/3

1

2/3

P(x2|C)C

1/3

0

2/3

P(x1|C)

3/4

1/4

3/4

P(C)

10

01

11

x2x1

0 1

1

0

1

1

h 1,1 arg max[ P( x 1|C 1) P( x2 1|C 1) P( C 1),

P( x 1|C 0 ) P( x2 1|C 0 ) P( C 0 )]

1

h 1,0 arg max[...,...] 1

h 0,1 arg max[...,...] 1

h 0,0 arg max[...,...] 0

n

1 n c C ii 1

h ,..., arg max P( | c ) P( c )

12

Naive Bayes - Independence

• The independence assumption is very strict!

• For most practical problems it is blatantly wrong!(not even fulfilled in the previous example!...see later)

=> Is naive Bayes a rather „academic“ algorithm ?

13


• For which problems is naive Bayes optimal ?(Lets assume for the moment we can perfectly

estimate all necessary probabilites)

• Guess: For problems for which the independence assumption holds

• Let‘s check… (empirically + theoretically)

14

Independence - Example

1111000

1/31/31/90100

0100010

2/31/31/91/3110

1000001

1/32/32/91/3101

0000011

2/32/34/91/3111

P(x2|C)P(x1|C)P(x1|C)P(x2|C)P(x1,x2|C)Cx2x1

0 1

1

0 1 2C x x

15

Independence - Example 1 2C x x

16

Independence - Example

1/21/21/41/2000

1/21/21/40100

1/21/21/40010

1/21/21/41/2110

1/21/21/40001

1/21/21/41/2101

1/21/21/41/2011

1/21/21/40111

P(x2|C)P(x1|C)P(x1|C)P(x2|C)P(x1,x2|C)Cx2x1

0 1

1

0 1 2C x x

17

Independence - Example 1 2C x x

18


[1] Domingos, Pazzani, Beyond independence, Conditions for the optimality of the simple Bayesian classifier, 1996

19


i j i j i jD x ,x |C H x |C H x |C H x x |C

20


• For which problems is naive Bayes optimal ?

• Guess:For problems for which the independence assumption holds

• Empirical answer: Not really….

• Theoretical answer ?

21

Naive Bayes - optimality

• Example: 3 features x1, x2, x3

• P(c=0) = P(c=1)

• x1, x3 independent; x2 = x1 (totally dep.)

=> optimal classification:

naive Bayes:

[1] Domingos, Pazzani, Beyond independence, Conditions for the optimality of the simple Bayesian classifier, 1996

opt 1 3 1 3

1 3 1 3

h sgn P x |1 P x |1 P x |0 P x |0

sgn P 1| x P 1| x P 0 | x P 0 | x

2 2

nb 1 3 1 3

2 2

1 3 1 3

h sgn P x |1 P x |1 P x |0 P x |0

sgn P 1| x P 1| x P 0 | x P 0 | x

22


• Let p =P(1|x1), q = P(1|x3)

• optimal:

• naive Bayes:

opth sgn p q (1 p ) (1 q )

nbh sgn p² q (1 p )² (1 q )

independence assumption holds

optimal and naive classifier disagree only

here

23


• In general: Instance x = <x1,…,xn>

Let

Theorem 1:A naive Bayesian classifier is optimal for x, iff

n

ii 1

n

ii 1

p P(1| x )

r P(1) / P( x ) P( x |1)

s P(0 ) / P( x ) P( x |0 )

1 1p r s p r s

2 2

24


region of optimality

independence assumption holds

only here

25


• This is a criterion for local optimality ( instance)

• What about global optimality ?Theorem 2: The naive Bayesian classifier is globally

optimal for a dataset Ѕ iff

x x x x x x

1 1x S : p r s p r s

2 2

26

Naive Bayes - optimality• What is the reason for this ?

– Difference between classification and probability (distribution) estimation

– I.e. for classification the perfect estimation of probabilities is not important as long as for each instance the maximum estimate corresponds to the maximum true probability.

• Problem with this result: Verification of global optimality (optimality for all instances) ?

27


• For which problems is naive Bayes optimal ?

• Guess:For problems for which the independence assumption holds


• Theoretical answer no 1:For all problems for for which Theorem 2 holds.

28

Naive Bayes - linearity

• other question:

how does naive Bayes‘ hypothesis depend on the input variables ?

• Consider simple case of binary variables only…

• It can be shown (e.g.[2]) that in binary domains naive Bayes is LINEAR in the input variables!!

[2]: Duda, Hart: Pattern classification and Scene Analysis, Wiley, 1973

29

Naive Bayes - linearity

• Proof…

30

Naive Bayes – linearity - examples

naive Bayes

Perceptron

31

Naive Bayes – linearity - examples

32

Naive Bayes - linearity• For boolean domains naive Bayes‘ hypothesis is

a linear hyperplane!

=> It can only be globally optimal for linearly separable problems!!

BUT: It is not optimal for all linearly separable problems! (e.g. not for certain m-out-of-n concepts)

33

Naive Bayes - optimality• For which problems is naive Bayes optimal ?• Guess:

For problems for which the independence assumption holds


• Theoretical answer no 1:For all problems for for which Theorem 2 holds.

• Theoretical answer no 2: For a (large) subset of the set of linearly separable problems.

34


class of concepts for which perceptron is optimal

class of concepts for which naive Bayes is optimal

35

Overview




36

Bayesian networks

• The problem-class for which naive Bayes is optimal is quite small….

• Idea: Relax the independence-assumption to obtain a more general classifier

• I.e. model cond. dependencies between variables

• Different techniques (e.g. hidden variables,…)

• Most established: Bayesian networks

37

Bayesian networks

• Bayesian network:– tool for representing statistical dependencies

between a set of random variables– acyclic directed graph– one vertex for each variable– for each pair of stat. dependent variables there is

an edge in the graph between the corresponding vertices

– not connected variables(vertices) are independent!– each vertex has a table of local probability

distributions

38

Bayesian networks

• Each variable is dependent only on its parents in the network!

y

x1

x3x2 x4

x5

„parents“ of x4 (Pa4)

i l i i i l 1 n i iP( x | x ,Pa ) P( x | Pa ); x { x ,....,x }\ x Pa

39

Bayesian networks

Bayesian network – based classifier:

y

x1

x3x2 x4

x5

n

1 n c C i ii 1

h ,..., arg max P( | c,Pa ) P( c )

1 n y 1 2 2

3 4 3 5 4

h ,..., arg max [ P( | , y ) P( | y )

P( | y ) P( | , y ) P( | , y ) P( y )]

40

Bayesian networks

• In the case of boolean attributes this is again linear, but not on the input-variables:

• Linear on product-features:

1 mi

n n

c Y i i i i i ii 1i 1

h( ) arg max P( | c,Pa ) P(c) sgn w [x Pa ... Pa ] b

å

41

Bayesian networks

• The difficulty here is to estimate the correct network-structure (and probability-parameters) from training data!

• For general Bayesian networks this problem is NP-hard!

• There exist numerous heuristics for learning Bayesian networks from data!

42

References[1] Domingos, Pazzani, Beyond independence, Conditions for the optimality of the simple Bayesian classifier, 1996

[2] Duda, Hart: Pattern classification and Scene Analysis, Wiley, 1973

bayesian networks martin bachler [email protected] mla - vo 06.12.2005

Documents