announcements hw0 duehw0 due i won’t take late day penalties for hw0i won’t take late day...

AnnouncementsAnnouncements

• HW0 dueHW0 due• I won’t take late day penalties for HW0I won’t take late day penalties for HW0

• Reading Assignment Reading Assignment • Online draft chapter of Naïve Bayes and Online draft chapter of Naïve Bayes and

Logistic Regression. I’ll put link on the web Logistic Regression. I’ll put link on the web pagepage

• HW1 coming soon (email)HW1 coming soon (email)• Due in two weeks from date assignedDue in two weeks from date assigned

• Today break around 7, class until 8 Today break around 7, class until 8 • Midterm examMidterm exam

• In class on Oct 25In class on Oct 25

Today’s TopicsToday’s Topics

• Naïve Bayes wrap upNaïve Bayes wrap up• Experimental methodologyExperimental methodology

Bayes’ Rule Applied to Bayes’ Rule Applied to MLML

P(class | F) =P(class | F) =P(F | class) * P(class)P(F | class) * P(class)

P(F)P(F)

Why do we care about Bayes’ rule?Why do we care about Bayes’ rule?Because while P(class|F) is typically difficult to Because while P(class|F) is typically difficult to

directly measure, the values on the RHS are directly measure, the values on the RHS are often easy to estimate (especially if we often easy to estimate (especially if we make simplifying assumptions)make simplifying assumptions)

Shorthand forShorthand for

P(class = P(class = c c | f| f11= v= v11, … , f, … , fnn = v = vnn))

The Naïve Bayes The Naïve Bayes AssumptionAssumption

NB assumes the conditionalNB assumes the conditional

Pr(fPr(f11= v= v11, … , f, … , fnn = v = vnn| class ) factors:| class ) factors:

P(P(ff11= v= v11, … , f, … , fnn = v = vnn | class) = | class) = P(f P(fii | class) | class)

That is, NB assumes that the value of That is, NB assumes that the value of feature i is feature i is conditionally conditionally independent independent of feature j given the of feature j given the classclass

Independence of Independence of EventsEvents• If A and B are independent, then:If A and B are independent, then:

P(A|B) = P(A)P(A|B) = P(A) P(B|A) = P(B)P(B|A) = P(B)

• And therefore:And therefore:P(A ^ B) = P(A)*P(B)P(A ^ B) = P(A)*P(B)

An example: roll two dice An example: roll two dice

case 1:case 1: A = die 1 roll > 3, B = die 2 roll > 3 A = die 1 roll > 3, B = die 2 roll > 3

case 2:case 2: A = die 1 roll > 3, B = sum of dice > 7 A = die 1 roll > 3, B = sum of dice > 7Are A and B independent?

Each implies the other

Conditional Conditional IndependenceIndependence

If A and B are conditionally If A and B are conditionally independent given C then:independent given C then:Pr(A|C,B) = Pr(A|C)Pr(A|C,B) = Pr(A|C)Pr(B|C,A) = Pr(B|C)Pr(B|C,A) = Pr(B|C)

Example:Example:A = My daughter’s blood typeA = My daughter’s blood typeB = My son’s blood typeB = My son’s blood typeC = My wife’s and my blood C = My wife’s and my blood typestypes

Sometimes two non-Sometimes two non-independent events independent events become independent become independent after “conditioning”after “conditioning”Therefore: Therefore:

Pr(A^B|C) = Pr(A|C)*Pr(B|C)Pr(A^B|C) = Pr(A|C)*Pr(B|C)

NOT twins!NOT twins!

Naïve Bayes RuleNaïve Bayes Rule

Derive on Whiteboard

P(F | c=+)P(F | c=+)P(F | c =-)P(F | c =-)

=P(c=+ | F)P(c=+ | F)P(c =- | F)P(c =- | F)

P( c=+)P( c=+)P( c =-)P( c =-)*

P(c=+| F ) = P( F | c=+) * P(c=+) / P(F)P(c=+| F ) = P( F | c=+) * P(c=+) / P(F)

P(c=- | F ) = P( F | c=- ) * P(c=-) / P(F)P(c=- | F ) = P( F | c=- ) * P(c=-) / P(F)

Form a ratio of conditional class probabilities

“Priors”

Naïve Bayes RuleNaïve Bayes Rule

=P(f1 | c=+) * … * P(fn | c=+)P(f1 | c=+) * … * P(fn | c=+)P(f1 | c =-) * … * P(fn | c = -)P(f1 | c =-) * … * P(fn | c = -)

P( c=+)P( c=+)P( c =-)P( c =-)

*

Uses NB assumption

Savvy Bayes?Savvy Bayes?

• Can partially represent dependencies by Can partially represent dependencies by including some compound features, including some compound features, such as:such as:P(A|B^C^+) = P(A|B^C^+) =

P(A|B^D) * P(B|+) * P(D|+) * P(+)P(A|B^D) * P(B|+) * P(D|+) * P(+)

• Allows compact representation of “full Allows compact representation of “full joint probability distribution”joint probability distribution”

• Bayesian Nets – topic of second half of Bayesian Nets – topic of second half of coursecourse

Technical Details Technical Details for Naïve Bayesfor Naïve Bayes

• Dealing with real-valued featuresDealing with real-valued features• DiscretizingDiscretizing• Directly representing real-valued Directly representing real-valued

featuresfeatures

• Avoiding zerosAvoiding zeros• Pr(feature = value) = 0Pr(feature = value) = 0

• Avoiding numeric underflowAvoiding numeric underflow

Dealing with real-valued Dealing with real-valued Features: Discritize Features: Discritize

• Partition feature into non-overlapping Partition feature into non-overlapping binsbins

• Estimate Pr( bin(f) | + ) and Pr( bin(f) | Estimate Pr( bin(f) | + ) and Pr( bin(f) | - )- )

1)1) Uniformly divide [min, max]Uniformly divide [min, max]Not the best idea. Why?Not the best idea. Why?

2)2) Put same number of examples in each binPut same number of examples in each bin

Better ideaBetter idea

+- - --

- --

+ +++++ +

++ +++

- ----

f

--

----

A B C D E

Pr( bin(f) = ‘A’ | + ) = ?2/13

Dealing With Real-Valued Dealing With Real-Valued Features: learn a PDFFeatures: learn a PDF

• Estimate Pr(f|+) and Pr(f|-) with a Estimate Pr(f|+) and Pr(f|-) with a probability density function (eg probability density function (eg Gaussian)Gaussian)

• Another example, sum of GaussiansAnother example, sum of Gaussians• (See paper by G. John & P. Langley )(See paper by G. John & P. Langley )

2( )

1

1( ) i

Nconst x x

i

P X eN

Sum over training examples

Pr(f|+)

Pr(f|-)

x

Pr(x)

Dealing With Real-Valued Dealing With Real-Valued Features: learn a PDFFeatures: learn a PDF

(Actually, compute P(X|+) and (Actually, compute P(X|+) and P(X|-) separately)P(X|-) separately)

P(X)

X

Sum of Gaussians example

Technical Details: Avoiding Technical Details: Avoiding ZerosZeros• Recall Naïve Bayes rule:Recall Naïve Bayes rule:

If Pr(f | c) = 0, we have problemsIf Pr(f | c) = 0, we have problems

• Need some method to prevent zero-valued Need some method to prevent zero-valued probabilitiesprobabilities

• One approach- Eq. 6.22 (next slide)One approach- Eq. 6.22 (next slide)

P(c=+ | F)P(c=+ | F)P(c =- | F)P(c =- | F)

=P(f1 | c=+) * … * P(fn | c=+)P(f1 | c=+) * … * P(fn | c=+)P(f1 | c =-) * … * P(fn | c = -)P(f1 | c =-) * … * P(fn | c = -)

P( c=+)P( c=+)P( c =-)P( c =-)

*

Avoiding Zeroes (Cont.)Avoiding Zeroes (Cont.)

• ““m- estimates”m- estimates”

p (fi=vi|c) =

# times fi = vi +

Equivalent sample size

# train ex’s +

Equivalent sample size

initial guess for p (fi = vi)

x

Estimate based on data

Estimate based on priors knowledge

m

M – Estimate of P(fM – Estimate of P(fii = = vvii))

Prob =

nc + m p

n + m# of fi = vi

examples# of actual examples

Equivalent sample size used in guess

Prior guess

Prob (color=red) =8 + 100 x 0.5

10 + 100=

58

110= 0.53

Example: Of 10 examples, 8 have color = red:

“Laplace” Correction: m*p = 1, m = the number of values for the feature

Naïve Bayes -Technical Naïve Bayes -Technical DetailsDetailsUnderflowUnderflow• If we have, say, 1000 features, we are If we have, say, 1000 features, we are

multiplying 1000 numbers in [0,1]multiplying 1000 numbers in [0,1]• Could lead to “underflow”Could lead to “underflow”

• Trick: instead of multiplying probabilities, Trick: instead of multiplying probabilities, sum the log of the probability: sum the log of the probability: log(x*y) = log(x*y) = log(x) + log(y)log(x) + log(y)

P(c=+ | F)P(c=+ | F)P(c =- | F)P(c =- | F)

=log P(F | c=+) + log P(c=+)log P(F | c=+) + log P(c=+)log P(F | c =-) log P(fn | c = -)log P(F | c =-) log P(fn | c = -)

log

Is Naiveté OK?Is Naiveté OK?

Surprisingly, the assumption of independence, while Surprisingly, the assumption of independence, while most likely violated, is not too harmful!most likely violated, is not too harmful!

• Naïve Bayes works surprisingly well Naïve Bayes works surprisingly well

- Very successful in text categorization - Very successful in text categorization

(“bag-of-words” rep)(“bag-of-words” rep)

- Used in printer diagnosis in Win 95, Office - Used in printer diagnosis in Win 95, Office Assistant,Assistant,

spam filtering, etcspam filtering, etc• Recent resurgence of research activity in Naïve Recent resurgence of research activity in Naïve

BayesBayes

Naïve Bayes is a Naïve Bayes is a “linear separator”“linear separator”

-

--

-

-

--

-+

+

+

+ + +

++

+

+

…

-

--

-

-

--

-+

+

+

+ + +

++

+

+?

…

-

+

Take on faith for now. Will “prove” in future lecture

Very different from nearest neighbor

Naïve Bayes Report CardNaïve Bayes Report Card

AA

AA

BB

BB

BB

Learning EfficiencyLearning Efficiency A+A+

Classification EfficiencyClassification EfficiencyFF

StabilityStability CC

Robustness (to noise)Robustness (to noise) DD

Empirical PerformanceEmpirical PerformanceCC

Domain InsightDomain Insight FF

Implementation EaseImplementation Ease AA

Incremental EaseIncremental Ease AA

K-NN NB

AA // harder for RVF

CC// depends on training// set size

BB // ratios indicate// informative features

Next Topic: Next Topic: MethodologyMethodology

• Train/Tune/Test SetsTrain/Tune/Test Sets• 10-fold Cross Validation10-fold Cross Validation• Misc. Experimental MethodsMisc. Experimental Methods

Evaluating ML Evaluating ML Algorithms:Algorithms:Theory BasedTheory Based• Computational learning theory Computational learning theory

(COLT)(COLT)• Probably approximately correct (PAC)Probably approximately correct (PAC)

• Which concepts are “learnable” from Which concepts are “learnable” from polynomial # of examplespolynomial # of examples

• Independent of example distributionIndependent of example distribution

• Mistake bound frameworkMistake bound framework• How many mistakes (on training set) will How many mistakes (on training set) will

learner make before it converges to the learner make before it converges to the correct hypothesis?correct hypothesis?

Chapter 7 of text, but we will not be able to spend Chapter 7 of text, but we will not be able to spend much if any time on this. (possible project topic)much if any time on this. (possible project topic)

Evaluating ML Evaluating ML Algorithms:Algorithms:Empirical Studies Empirical Studies • Some Evaluation Measures:Some Evaluation Measures:

• Correctness on Correctness on novelnovel examples (inductive examples (inductive learning)learning)

• Time spent learningTime spent learning• Time needed to apply result learnedTime needed to apply result learned• Speedup after learning (explanation-based Speedup after learning (explanation-based

learning) learning) • Space requiredSpace required

• Basic idea: repeatedly use train/test sets Basic idea: repeatedly use train/test sets to estimate future accuracyto estimate future accuracy

Our focus

Some Typical ML Some Typical ML Experiments –Experiments –Empirical LearningEmpirical Learning

# of training examples -- “learning curves”

(or : amount of noise / amount of missing features)

Test set

AccuracyAlgorithm1

Algorithm2

Confidence Bars (from multiple runs)

most commonly used

Some Typical ML Some Typical ML Experiments –Experiments – Speedup Learning Speedup Learning

# of training examples

Test set

Problem Solving

Time

Algorithm1

Algorithm2

Standard Methodology Standard Methodology for Comparing Learnersfor Comparing Learners

1)1) Start with dataset of labeled examplesStart with dataset of labeled examples

2)2) Randomly partition into N equal sized groupsRandomly partition into N equal sized groups

3)3) N times, use N-1 groups to from N times, use N-1 groups to from training settraining seta)a) Provide Provide training settraining set to learner to learner

b)b) Measure performance on left out group (test set)Measure performance on left out group (test set)

4)4) Evaluation based on results of N “folds”Evaluation based on results of N “folds”

Called N-fold cross validation (often N=10) Called N-fold cross validation (often N=10)

Experimental Experimental Methodology: A Pictorial Methodology: A Pictorial OverviewOverview

LEARNER

training examples

testing examples

classifier

expected accuracy on future examples

collection of classified examples

Statistical techniques such as 10-fold cross validation and t-tests are used to get meaningful results

Tuning SetsTuning Sets

• Often, ML system needs to set internal Often, ML system needs to set internal parametersparameters

• for example: # of training iterations, K in K-NN, etcfor example: # of training iterations, K in K-NN, etc• Goal: find parameter settings that gives Goal: find parameter settings that gives

maximum accuracy on maximum accuracy on future examples examples• It is It is “cheating”“cheating” to see to see testtest set labels now set labels now• Tuning set approachTuning set approach

1)1) Set aside part of Set aside part of trainingtraining set as set as tuningtuning set set2)2) Set parameters to some valueSet parameters to some value3)3) Train with remainder of training setTrain with remainder of training set4)4) Estimate future performance for current Estimate future performance for current

parameters using tuning setparameters using tuning set5)5) Keep best parameter settings. Train with Keep best parameter settings. Train with allall

training data. Estimate future performance with training data. Estimate future performance with test set.test set.

Repeat for each parameter setting

Experimental Experimental Methodology: A Pictorial Methodology: A Pictorial OverviewOverview

select best param settings

LEARNER

training examples

testing examples

classifier

expected accuracy on future examples

collection of classified examples

training set tune set

generate solutions

Train on entire training set

Improper Experimental Improper Experimental Methodology Can Have a Methodology Can Have a Huge Impact !Huge Impact !

A 2002 paper in A 2002 paper in Nature Nature (a major, major (a major, major journal) needed to be corrected due to journal) needed to be corrected due to “training on the testing set”“training on the testing set”

Original report : 95% accuracy (5% error)Original report : 95% accuracy (5% error)

Corrected report (which still is buggy): Corrected report (which still is buggy): 73% accuracy (27% error rate)73% accuracy (27% error rate)

Error rate increased over 400%!Error rate increased over 400%!

Parameter SettingParameter Setting

Notice that each train/test fold may get Notice that each train/test fold may get different parameter settings!different parameter settings!• That’s fine (and proper) That’s fine (and proper)

I.e. , a “parameterless”* algorithm I.e. , a “parameterless”* algorithm internally sets parameters for internally sets parameters for each data each data setset it gets it gets

**Usually, though, some parameters have be Usually, though, some parameters have be externally fixed (e.g. knowledge of the data, externally fixed (e.g. knowledge of the data, range of parameter settings to try ,etc) range of parameter settings to try ,etc)

Using Multiple Tuning Using Multiple Tuning setssets• Using a Using a singlesingle tuning set can be tuning set can be

unreliable predictor, plus some data unreliable predictor, plus some data “wasted”. Hence, often the following is “wasted”. Hence, often the following is done:done:• 1)For each possible set of parameters,1)For each possible set of parameters,

• a) Divide training data into a) Divide training data into train’train’ and and tunetune sets, sets, using using N-fold cross validationN-fold cross validation

• b) Score this set of parameter value, average b) Score this set of parameter value, average tune tune set accuracyset accuracy

• 2) Use 2) Use bestbest set of parameter settings and set of parameter settings and all (train’ + tune)all (train’ + tune) examples examples

• 3) Apply result to 3) Apply result to testtest set set

Example: tuning a Example: tuning a parameterparameter

• Consider one training set. What’s Consider one training set. What’s the best # if “hidden units” (for a the best # if “hidden units” (for a neural network) to use (for neural network) to use (for thisthis training set)?training set)?

Example: Tuning a Example: Tuning a parameterparameter

• Step1: Try various values for Step1: Try various values for kk (# of hidden units). Use 10 train/tune (# of hidden units). Use 10 train/tune splits for each splits for each kk

• Step2: Pick best value for Step2: Pick best value for k k (eg. (eg. kk=2).Then train using =2).Then train using allall training training data.data.

• Step3: Measure accuracy on Step3: Measure accuracy on test settest set

K=0

tune train

Tune set accuracy (ave. over 10 runs)=92%

1

10

2

K=2 Tune set accuracy (ave. over 10 runs)=97%

1

10

2

…

Tune set accuracy (ave. over 10 runs)=80%

1

10

2K=100

What to Do for the What to Do for the FIELDED System?FIELDED System?

• Do Do notnot use any use any testtest sets. sets.• Instead only use Instead only use tuningtuning sets to sets to

determine good parametersdetermine good parameters• TestTest sets used to estimate sets used to estimate futurefuture

performanceperformance• You can report this estimate to your You can report this estimate to your

“customer”, then use “customer”, then use allall the data to the data to retrain a “product” to give themretrain a “product” to give them

announcements hw0 duehw0 due i won’t take late day penalties for hw0i won’t take late day...

Documents

independent given c

whiteboardnave bayes

b independent

nave bayes rulederive

zerosrecall nave bayes

draft chapter of nave

vnthe nave bayes assumptionnb

pf class