required sample size for bayesian network structure learning

Required Sample size for Bayesian network Structure learning

Samee Ullah Khan

and

Kwan Wai Bong Peter

Outline

Motivation IntroductionSample Complexity

– Sanjoy Dasgupta– Russell Greiner– Nir Friedman– David Haussler

SummaryConclusion

Motivation

John Works at a Pharmaceutical Company.Optimal Sample Size of a Clinical Trial? It’s a function of Both Statistical Significance of

the Difference and the Magnitude of Apparent difference between Performances.

Purpose: A tool (measure) for Public and Commercial vendors to plan clinical trials.

Looking For: Gain acceptance from potential users.Statistically Significance Evidence

Motivation: Solution

Optimize the difference between the performances of both treatments.

Let C= diff (expected cost of new treatment –expected cost of old treatment)

Motivation

C=0, m= users, is the difference in performance

Motivation

C>0

Motivation

C<0

Motivation: Conclusion

Actual improvement in performance is known It may be extended to the uncertainty about the

amount of improvement. It is also possible to shift the functions 1` or

2`to right. Where ` is standard deviation of the posterior

distribution of unknown parameter .

Motivation: Model

Paired Observations (X1,Y1),(X2,Y2)……..Xi is new clinical outcome Yi is old clinical outcomeLet Z be the objective function Zi=Xi-Yi (i=1,2,3……….)Assume that has normal density N(,2)Formulating our previous knowledge about assume a

prior density N(,2).Under the assumptions is a sufficient statistics for the

parameter .

Introduction

Efficient learning -- more accurate models with less data – Compare: P(A) and P(B) vs joint P(A,B) former

requires less data! – Discover structural properties of the domain – Identifying independencies in the domain helps to

• Order events that occur sequentially • Sensitivity analysis and inference

Predict effect of actions – Involves learning causal relationship among variables

Introduction

Why Struggle for Accurate Structure

Introduction

Adding an Arc

– Increases the number of parameters to be fitted – Wrong assumptions about causality and

domain structure

Introduction

Deleting an Arc

– Cannot be compensated by accurate fitting of parameters

– Also misses causality and domain structure

Introduction

Approaches to Learning Structure– Constraint based

• Perform tests of conditional independence • Search for a network that is consistent with the

observed dependencies and independencies

– Score based • Define a score that evaluates how well the

(in)dependencies in a structure match the observations

• Search for a structure that maximizes the score

Introduction

Constraints versus Scores– Constraint based

• Intuitive, follows closely the definition of BNs • Separates structure construction from the form of the

independence tests • Sensitive to errors in individual tests

– Score based • Statistically motivated • Can make compromises

– Both • Consistent---with sufficient amounts of data and

computation, they learn the correct structure

Dasgupta’s model

Haussler’s extension of the PAC framework

Situation: fixed network structure Goal: To learn the conditional probability

functions accurately

Dasgupta’s model

A learning algorithm A:– Given:

1) An approximation parameter > 02) A confidence parameter 0 < < 1

3) Variables drawn from a instance space X, x1, x2, …, xn

4) An oracle which generates randomly instances of X according to some unknown distribution P that we are going to learn

5) Some hypothesis class H

Dasgupta’s model

– Output: hypothesis h H such that with probability > 1-

where

d(.,.) is a distance measure

hopt is the concept h’ H that minimizes d(P, h’)

),(),( opthPdhPd

Dasgupta’s model: Distance measure

Most intuitive: L1 norm

Most popular: Kullback-Leibler divergence (relative entropy)

Minimizing dKL with respect to the empirically observed distribution is equivalent to solving the maximum likelihood problem

)(

)(log)(),(

xh

xP

XxxPhPKLd

Dasgupta’s model: Distance measure

Disadvantage of dKL: unbounded

So, the measure adopted in this model is relative entropy by replacing log with ln.

Dasgupta’s model

The algorithm, given m samples drawn from some distribution P, finds the best fitting hypothesis by evaluating each h(,)H(,) by computing the empirical log loss E(-ln h(,)) and returning the hypothesis with the smallest value, where H(,)H, called an (,)-bounded approximation of H.

Dasgupta’s model

By using Hoeffding and Chernoff bounds, the number of samples needed is bounded by

Lower bound:

)31ln(218ln)

31(ln

22

222288 nnn kkn

32)(n

Rusell Greiner’s claim

Many learning algorithms that determine which Bayesian network is optimal usually based on some measures such as log-likelihood, MDL, BIC. These typical measures are independent of the queries that will be posed.

Learning algorithms should consider the distribution of queries as well as the underlying distribution of events, and seek the BN with the best performance over the query distribution rather than the one that appears closest to the underlying event distribution.

Russell Greiner’s model

LetV: set of the N variablesSQ: set of all possible legal statistical queriessq(x; y): a distribution over SQ

Suppose we fixed a network B over V, and let B(x|y) be the real-value probability that B returns for this assignment. Given distribution sq(.,.) over SQ, the “score” of B is

err(B)=errsq,p(B) if sq, p are clear from context

yxyxpyxByxsqBpsqerr

,

2)]|()|()[;()(,


Observation:– Any Bayesian network B* that encodes the underlying

distribution p(.), will in fact produce the optimal performance; i.e. err(B*) will be optimal

– This means that if we have a learning algorithm that produces better approximations to p(.) as it sees more training examples, then in the limit the sq(.) distribution becomes irrelevant.


Given a set of labeled statistical queries Q={<xi;yi;pi>}i, let

be the empirical score of the Bayesian net.

QpYX

Q pyxBQ

Berr;;

2)|(1

)(


Compute err(B):– #P-hard to compute the estimate of

err(B) from general statistical queries If we know that all queries encountered sq(x;y),

satisfy p(y) for some >0, then we only need

complete event examples, withexample queries to obtain an -close estimate, with probability at least 1-.

)(BerrQ

4

ln2

2),( SQM

}4

ln2

8],

4ln

2ln

2

8[

2max{

SQMSQMSQM

Nir Friedman’s model

Review – BN is composed of two parts.

• DAG• Parameters encoding

– Setup• Let B* be a BN that describe the target distributions from

training samples.• Entropy Distance (Kullback-Leibler)

• Learn from Random Variables, decrease with N.

)(

)(log)(),(

xh

xP

XxxPhPKLd

Nir Friedman’s model: Learning

Criteria:– Error Threshold – Confidence Threshold

N(,) sample size If the sample size is larger than N(,) then

Pr(D(PLrn()||P)>)< where Lrn() represents the learning routine.

If N(,) is MINIMAL the it is called sample complexity.

Nir Friedman’s model:Notations

Vector Valued U={X1, X2,……Xn}– X,Y,Z Variables– x,y,z values

So B=<G,>– G is DAG are number of parameters xi|xi =P(xi|xi)

BN is minimal

n

iiXiXiX

n

iiXBPnXXXBP

1|)

1|(),...,2,1(

Nir Friedman’s model:Learning

Given a training set wN={u1,……..un} of U

find B that best matches D.The loglikelihood of B:

Decomposing loglikelihood according to structure:

N

jjBNBN uPuuPBLL

11 ))(log(),...,(log)(

Au

Auuu

NAP A

jjAN if 0

if 1)(1 where)(1

1)(̂

Nir Friedman’s model:Learning

So we can derive

Assume G has fixed structure, optimize

Argument is large networks not desirable

i ixix ixixixiN xPNBNLL

,|log),(ˆ)(

)|(ˆ| ixiNixix xP

)()(),( NGGLLGS NN

Nir Friedman’s model: PSM

Penalized weighting function:MDL principle:

– Total description length of data

– AIC

– BIC

)(N

),( GS N

cN )(

NN log21

)(

),(max)))((,( GSLrnGphS NG

NN

Nir Friedman’s model: Sample Complexity

Sample complexity – Log-likelihood and penality term– Random noise

Entropy distance

)()(

log)(),(xhxP

XxxPhPKLd

x

xQxPQP )()(1


Idealized case*),*(),*( GSGS NN

)21()(

),1ˆ(),2ˆ( GGN

NNGPNPDNGPNPD

GG

NN

*

)(

y x

xyyxy

logthen,log2 if ,4Let


Sub-sampling strategies in learning)(),()( ˆˆˆ iXNPiXiNPiXiNP HXHXH

m

NX

NXHXNPH

1log3

122

)1())()(ˆPr(

Nir Friedman’s model: Summary

It can be shown on the sample complexity of BN using MDL

– Bound is loose– To search for an optimal structure is NP-hard

)1

loglog1

log1

log)1

((),( 34

ON

David Haussler’s model

The model is based on prediction. The learner attempts to infer an unknown target concept f chosen from a concept class F of {0, 1} valued function.

For any given instance i, the learner predicts value of f(xi).

After the prediction, the learner is to the correct answer. It improves on the result.


Criteria for sample bounds:– Probability of f(xm+1) over (x1, f(x1)), …,

(xm,f(xm))

Cumulative mistakes made over m trialsThe model uses VC dimension

VC

General condition for uniform convergence:

Definition:– Shattered set. Let X be the instance space and C the

concept class– SX, shattered by C– S’ S, c C which contains all S’ and none of S-S’– SX, C(S) S

] with consistent is )(|[Pr ShhDerrorChm

Ds


Information Gain– At instance m, the learner has observed f(x1),

…,f(xm) labels predict f(xm+1)

)(1log

)(

)(1log

]1),()(ˆ|)1()1(ˆ[ˆPrlog

)(1),(1

fm

fmV

fmV

miixfixfmxfmxfmPf

fmIfxP

mI


)(/)()(),( 111 fVfVffx mmmPm

)](log[E)]([E 11 ffI mPfmPf

))](1())(1())(()([E

))](([E

1111

1

fGffGf

fG

mmmmPf

mPf

required sample size for bayesian network structure learning

Documents