statistical methods for data science, lecture 5 interval ...richajo/dit862/l5/l5.pdf · -20pt...

Statistical methods for Data Science, Lecture 5Interval estimates; comparing systems

Richard Johansson

November 18, 2018

statistical inference: overview

I estimate the value of some parameter (last lecture):I what is the error rate of my drug test?

I determine some interval that is very likely to contain the truevalue of the parameter (today):I interval estimate for the error rate

I test some hypothesis about the parameter (today):I is the error rate significantly different from 0.03?I are users significantly more satisfied with web page A than

with web page B?

“recipes”

I in this lecture, we’ll look at a few “recipes” that you’ll use inthe assignmentI interval estimate for a proportion (“heads probability”)I comparing a proportion to a specified valueI comparing two proportions

I additionally, we’ll see the standard method to compute aninterval estimate for the mean of a normal

I I will also post some pointers to additional testsI remember to check that the preconditions are satisfied: what

kind of experiment? what assumptions about the data?

overview

interval estimates

significance testing for the accuracy

comparing two classifiers

p-value fishing

interval estimates

I if we get some estimate by ML, can we say something abouthow reliable that estimate is?

I informally, an interval estimate for the parameter p is aninterval I = [plow , phigh] so that the true value of theparameter is “likely” to be contained in I

I for instance: with 95% probability, the error rate of the spamfilter is in the interval [0.05, 0.08]

frequentists and Bayesians again. . .

I [frequentist] a 95% confidence interval I is computed usinga procedure that will return intervals that contain p at least95% of the time

I [Bayesian] a 95% credible interval I for the parameter p isan interval such that p lies in I with a probability of at least95%

interval estimates: overview

I we will now see two recipes for computing confidence/credibleintervals in specific situations:I for probability estimates, such as the accuracy of a classifier

(to be used in the next assignment)I for the mean, when the data is assumed to be normal

I . . . and then, a general method

the distribution of our estimator

I our ML or MAP estimator applied to randomly selectedsamples is a random variable with a distribution

I this distribution depends on thesample sizeI large sample → more concentrated

distribution

0.0 0.2 0.4 0.6 0.8 1.0n = 25

estimator distribution and sample size (p = 0.35)

0.0 0.2 0.4 0.6 0.8 1.0n = 10

0.0 0.2 0.4 0.6 0.8 1.0n = 25

0.0 0.2 0.4 0.6 0.8 1.0n = 50

0.0 0.2 0.4 0.6 0.8 1.0n = 100

confidence and credible intervals for the proportionparameter

I several recipes, see https:

//en.wikipedia.org/wiki/Binomial_proportion_confidence_interval

I traditional textbook method for confidence intervals is basedon approximating a binomial with a normal

I instead, we’ll consider a method to compute a Bayesiancredible interval that does not use any approximationsI works fine even if the numbers are small

credible intervals in Bayesian statistics

1. choose a prior distribution

0.0 0.2 0.4 0.6 0.8 1.0

2. compute a posterior distribution from the prior and the data

0.0 0.2 0.4 0.6 0.8 1.0

3. select an interval that covers e.g. 95% of the posteriordistribution

0.0 0.2 0.4 0.6 0.8 1.0

recipe 1: credible interval for the estimation of a probability

I assume we carry out n independent trials, with k successes,n − k failures

I choose a Beta prior for the probability; that is, select shapeparameters a and b (for uniform prior, set a = b = 1)

0.0 0.2 0.4 0.6 0.8 1.0

I then the posterior is also a Beta, with parameters k + a and(n − k) + b

0.0 0.2 0.4 0.6 0.8 1.0

I select a 95% interval

0.0 0.2 0.4 0.6 0.8 1.0

in Scipy

I assume n_success successes out of nI recall that we use ppf to get the percentiles!I or even simpler, use interval

a = 1b = a

n_fail = n - n_successposterior_distr = stats.beta(n_success + a, n_fail + b)

p_low, p_high = posterior_distr.interval(0.95)

example: political polling

I we ask 87 randomly selected Gothenburgers about whetherthey support the proposed aerial tramway line over the river

I 81 of them say yesI a 95% credible interval for the popularity of the tramway is

0.857 – 0.967

n_for = 81n = 87n_against = n - n_for

p_mle = n_for / n

posterior_distr = stats.beta(n_for + 1, n_against + 1)

print(’ML / MAP estimate:’, p_mle)print(’95% credible interval: ’, posterior_distr.interval(0.95))

don’t forget your common sense

I I ask 14 Applied Data Science students about whether theysupport free transporation between Johanneberg andLindholmen, 12 of them say yes

I will I get a good estimate?

recipe 2: mean of a normal

I we have some sample that we assume follows some normaldistribution; we don’t know the mean µ or the standarddeviation σ; the data points are independent

I can we make an interval estimate for the parameter µ?

I frequentist confidence intervals, but also Bayesian credibleintervals, are based on the t distributionI this is a bell-shaped distribution with longer tails than the

normal

2.00 2.25 2.50 2.75 3.00 3.25 3.50 3.75 4.000.0

I the t distribution has a parameter called degrees of freedom(df) that controls the tails

recipe 2: mean of a normal

I we have some sample that we assume follows some normaldistribution; we don’t know the mean µ or the standarddeviation σ; the data points are independent

I can we make an interval estimate for the parameter µ?I frequentist confidence intervals, but also Bayesian credible

intervals, are based on the t distributionI this is a bell-shaped distribution with longer tails than the

normal

2.00 2.25 2.50 2.75 3.00 3.25 3.50 3.75 4.000.0

I the t distribution has a parameter called degrees of freedom(df) that controls the tails

recipe 2: mean of a normal (continued)

I x_mle is the sample mean; the size of the dataset is n; thesample standard deviation is s

I we consider a t distribution:posterior_distr = stats.t(loc = x_mle, scale = s/np.sqrt(n), df = n-1)

2.00 2.25 2.50 2.75 3.00 3.25 3.50 3.75 4.000.0

I to get an interval estimate, select a 95% interval in thisdistribution

2.00 2.25 2.50 2.75 3.00 3.25 3.50 3.75 4.00

example

I to demonstrate, we generate some data:

x = pd.Series(np.random.normal(loc=3, scale=0.5, size=500))

I a 95% confidence/credible interval for the mean:

mu_mle = x.mean()

s = x.std()n = len(x)

posterior_distr = stats.t(df=n-1, loc=mu_mle, scale=s/np.sqrt(n))

print(’estimate:’, mu_mle)print(’95% credible interval: ’, posterior_distr.interval(0.95))

alternative: estimation using bayes_mvs

I SciPy has a built-in function for the estimation of mean,variance, and standard deviation:https://docs.scipy.org/doc/scipy-0.19.1/reference/generated/

scipy.stats.bayes_mvs.html

I 95% credible intervals for the mean and the std:

res_mean, _, res_std = stats.bayes_mvs(x, 0.95)

mu_est, (mu_low, mu_high) = res_meansigma_est, (sigma_low, sigma_high) = res_std

recipe 3 (if we have time): brute force

I what if we have no clue about how our measurements aredistributed?I word error rate for speech recognitionI BLEU for machine translation

the brute-force solution to interval estimates

I the variation in our estimate depends on the distribution ofpossible datasets

I in theory, we could find a confidence interval by consideringthe distribution of all possible datasets, but this can’t be donein practice

I the trick in bootstrapping – invented by Bradley Efron – is toassume that we can simulate the distribution of possibledatasets by picking randomly from the original dataset

the brute-force solution to interval estimates

I the variation in our estimate depends on the distribution ofpossible datasets

I in theory, we could find a confidence interval by consideringthe distribution of all possible datasets, but this can’t be donein practice

I the trick in bootstrapping – invented by Bradley Efron – is toassume that we can simulate the distribution of possibledatasets by picking randomly from the original dataset

bootstrapping a confidence interval, pseudocode

I we have a dataset D consisting of k itemsI we compute a confidence interval by generating N random

datasets and finding the interval where most estimates end up

repeat N timesD∗ = pick k items randomly from Dm = estimate on D∗

store m in a list Mreturn 2.5% and 97.5% percentiles of M

0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.980

I see Wikipedia for different varieties

overview

interval estimates

p-value fishing

statistical significance testing for the accuracy

I in the assignment, you will consider two questions:I how sure are we that the true accuracy is different from 0.80?I how sure are we that classifier A is better than classifier B?

I we’ll see recipes that can be used in these two scenariosI these recipes work when we can assume that the “tests” (e.g.

documents) are independentI for tests in general, see e.g. Wikipedia

comparing the accuracy to some given value

I my boss has told me to build a classifier with an accuracy ofat least 0.70

I my NB classifier made 40 correct predictions out of 50I so the MLE of the accuracy is 0.80

I based on this experiment, how certain can I be that theaccuracy is really different from 0.70?

I if the true accuracy is 0.70, how unusual is our outcome?

null hypothesis significance tests (NHST)

I we assume a null hypothesis and then see how unusual(extreme) our outcome isI the null hypothesis is typically “boring”: the true accuracy is

equal to 0.7I the “unusualness” is measured by the p-value

I if the null hypothesis is true, how likely are we to see anoutcome as unusual as the one we got?

I the traditional threshold for p-values to be considered“significant” is 0.05

the exact binomial test

I the exact binomial test is used when comparing an estimatedprobability/proportion (e.g. the accuracy) to some fixed valueI 40 correct guesses out of 50I is the true accuracy really different from 0.70?

I if the null hypothesis is true, then this experiment correspondsto a binomially distributed r.v. with parameters 50 and 0.70

I we compute the p-value as the probability of getting anoutcome at least as unusual as 40

historical side note: sex ratio at birth

I the first known case where a p-value was computed involvedthe investigation of sex ratios at birth in London in 1710

I null hypothesis: P(boy) = P(girl) = 0.5I result: p close to 0 (significantly more boys)

“From whence it follows, that it is Art, not Chance, that governs.”(Arbuthnot, An argument for Divine Providence, taken from the constant

regularity observed in the births of both sexes, 1710)

example

I 40 correct guesses out of 50I if the true accuracy is 0.70, is 40 out of 50 an unusual result?

0 10 20 30 40 500.00

outcome

the p-value is 0.16

example

I 40 correct guesses out of 50I if the true accuracy is 0.70, is 40 out of 50 an unusual result?

0 10 20 30 40 500.00

outcome

I the p-value is 0.16, which isn’t “significantly” unusual!

implementing the exact binomial test in Scipy

I assume we made x correct guesses out of nI is the accuracy significantly different from test_acc?I the p-value is the sum of the probabilities of the outcomes

that are at least as “unusual” as x:import scipy.stats

def exact_binom_test(x, n, test_acc):rv = scipy.stats.binom(n, test_acc)p_x = rv.pmf(x)p_value = 0for i in range(0, n+1):

p_i = rv.pmf(i)if p_i <= p_x:

p_value += p_ireturn p_value

I actually, we don’t have to implement it since there is afunction scipy.stats.binom_test that does exactly this!

overview

interval estimates

p-value fishing

I I’m comparing a Naive Bayes and a perceptron classifierI we evaluate them on the same test setI the NB classifier had 186 correct out of 312 guessesI . . . and the perceptron had 164 correct guessesI so the ML estimates of the accuracies are 0.60 and 0.53,

respectivelyI but does this strongly support that the NB classifier is really

better?

contingency table

I we make a table that compares the errors of the two classifiers:

NB correct NB incorrectperc correct A = 125 B = 39perc incorrect C = 61 D = 87

I if NB is about as good as the perceptron, the B and C valuesshould be similarI conversely if they are really different, B and C should differ

I are these B and C value unusual?

McNemar’s test

I in McNemar’s test, we model the discrepancies (the B and Cvalues)

I there are a number of variants of this testI the original formulation:

Quinn McNemar (1947). Note on the sampling error of thedifference between correlated proportions or percentages,Psychometrika 12:153-157.

I our version builds on the exact binomial test that we sawbefore

McNemar’s test (continued)

I the number of discrepancies is B+CI how are the discrepancies distributed?

I if the two systems are equivalent, the discrepancies should bemore or less evenly spread into the B and C boxes

I it can be shown that B would be a binomial random variablewith parameters B+C and 0.5

I so we can find the p-value (the “unusualness”) like this:p_value = scipy.stats.binom_test(B, B+C, 0.5)

I in this case it is 0.035, supporting the claim that NB is better

alternative implementation

http://www.statsmodels.org/dev/generated/statsmodels.stats.contingency_tables.mcnemar.html

overview

interval estimates

p-value fishing

searching for significant effects

I scientific investigations sometimes operate according to thefollowing procedure:1. propose some hypothesis2. collect some data3. do we get a “significant” p-value over some null hypothesis?4. if no, revise hypothesis and go back to 3.5. if yes, publish your findings, promote them in the media, . . .

searching for significant effects (alternative)

I or a “data science” experiment:1. you are given some dataset and told to “extract some

meaning” from it2. look at the data until you find a “significant” effect3. publish . . .

searching for significant effects

I remember: if the null hypothesis is true, we will still see“significant” effects about 5% of the time

I consequence: if we search long enough, we will probably findsome effect with a p-value that is smallI even if this is just due to chance

spurious correlations

f people

killed b

ous sp

Spelli

Letters in winning word of Scripps National Spelling Beecorrelates with

Number of people killed by venomous spiders

Number of people killed by venomous spidersSpelling Bee winning word

1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009

0 deaths

5 deaths

10 deaths

15 deaths

5 letters

10 letters

15 letters

tylervigen.com

example

http://andrewgelman.com/2017/11/11/

student-bosses-want-p-hack-dont-even-know/

“data dredging”: further reading

https://en.wikipedia.org/wiki/Data_dredging

https:

//en.wikipedia.org/wiki/Testing_hypotheses_suggested_by_the_data

some solutions

I common senseI held-out data (a separate test set)I correcting for multiple comparisons

Bonferroni correction for multiple comparisons

I assume we have an experiment where we carry out Ncomparions

I in the Bonferroni correction, we multiply the p-values of theindividual tests by N (or alternatively, divide the “significance”threshold by N)

Bonferroni correction for multiple comparisons: example

the rest of the week

I Wednesday: Naive Bayes and evaluation assignmentI Thursday: probabilistic clustering (Morteza)I Friday: QA hours (14–16)

statistical methods for data science, lecture 5 interval ...richajo/dit862/l5/l5.pdf · -20pt...

Documents

lumbar primary dorsal branch l5 - wipbenelux.org lataster...

1 8%20chien%20luoc%20pt%20logictics

applied machine learning lecture 5-1: optimization in...

l5 project

l5- saliva

253%20robins%20pt%20 %20low

memmert banos de agua/banos de aceite...

solução%20bi%20retail%20 %20pt

lumikko l5

applied machine learning lecture 5-2: logistic regression...

l5 requirements

l7 l4 l5 l7 l 8 l3 l4 l5 l7 l10 l11 estaciÓ d ...l4 l5 l7...

text: erster unterpunkt: trebuchet ms-20pt Überschriften...

presentation title (verdana 20pt) - maximum 2...

evaluation of the new waas l5 signal -...

マイクロコンポーネントシステム...

dhow2 - l5

presentation title, arial regular 25pt sub title, arial...

messo%20pt%202012 05 freeze%20concentration tcm11 12652

business & strategy · 2014-11-05 · body text: 26pt arial...