statistical methods for data science, lecture 5 interval ...richajo/dit862/l5/l5.pdf · -20pt...
Post on 06-Jul-2020
4 Views
Preview:
TRANSCRIPT
Statistical methods for Data Science, Lecture 5Interval estimates; comparing systems
Richard Johansson
November 18, 2018
-20pt
statistical inference: overview
I estimate the value of some parameter (last lecture):I what is the error rate of my drug test?
I determine some interval that is very likely to contain the truevalue of the parameter (today):I interval estimate for the error rate
I test some hypothesis about the parameter (today):I is the error rate significantly different from 0.03?I are users significantly more satisfied with web page A than
with web page B?
-20pt
“recipes”
I in this lecture, we’ll look at a few “recipes” that you’ll use inthe assignmentI interval estimate for a proportion (“heads probability”)I comparing a proportion to a specified valueI comparing two proportions
I additionally, we’ll see the standard method to compute aninterval estimate for the mean of a normal
I I will also post some pointers to additional testsI remember to check that the preconditions are satisfied: what
kind of experiment? what assumptions about the data?
-20pt
overview
interval estimates
significance testing for the accuracy
comparing two classifiers
p-value fishing
-20pt
interval estimates
I if we get some estimate by ML, can we say something abouthow reliable that estimate is?
I informally, an interval estimate for the parameter p is aninterval I = [plow , phigh] so that the true value of theparameter is “likely” to be contained in I
I for instance: with 95% probability, the error rate of the spamfilter is in the interval [0.05, 0.08]
-20pt
frequentists and Bayesians again. . .
I [frequentist] a 95% confidence interval I is computed usinga procedure that will return intervals that contain p at least95% of the time
I [Bayesian] a 95% credible interval I for the parameter p isan interval such that p lies in I with a probability of at least95%
-20pt
interval estimates: overview
I we will now see two recipes for computing confidence/credibleintervals in specific situations:I for probability estimates, such as the accuracy of a classifier
(to be used in the next assignment)I for the mean, when the data is assumed to be normal
I . . . and then, a general method
-20pt
the distribution of our estimator
I our ML or MAP estimator applied to randomly selectedsamples is a random variable with a distribution
I this distribution depends on thesample sizeI large sample → more concentrated
distribution
0.0 0.2 0.4 0.6 0.8 1.0n = 25
0.00
0.05
0.10
0.15
-20pt
estimator distribution and sample size (p = 0.35)
0.0 0.2 0.4 0.6 0.8 1.0n = 10
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.0 0.2 0.4 0.6 0.8 1.0n = 25
0.00
0.05
0.10
0.15
0.0 0.2 0.4 0.6 0.8 1.0n = 50
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
0.0 0.2 0.4 0.6 0.8 1.0n = 100
0.00
0.02
0.04
0.06
0.08
0.10
-20pt
confidence and credible intervals for the proportionparameter
I several recipes, see https:
//en.wikipedia.org/wiki/Binomial_proportion_confidence_interval
I traditional textbook method for confidence intervals is basedon approximating a binomial with a normal
I instead, we’ll consider a method to compute a Bayesiancredible interval that does not use any approximationsI works fine even if the numbers are small
-20pt
credible intervals in Bayesian statistics
1. choose a prior distribution
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
2. compute a posterior distribution from the prior and the data
0.0 0.2 0.4 0.6 0.8 1.0
0
1
2
3
4
5
3. select an interval that covers e.g. 95% of the posteriordistribution
0.0 0.2 0.4 0.6 0.8 1.0
0
1
2
3
4
5
-20pt
recipe 1: credible interval for the estimation of a probability
I assume we carry out n independent trials, with k successes,n − k failures
I choose a Beta prior for the probability; that is, select shapeparameters a and b (for uniform prior, set a = b = 1)
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
I then the posterior is also a Beta, with parameters k + a and(n − k) + b
0.0 0.2 0.4 0.6 0.8 1.0
0
1
2
3
4
5
I select a 95% interval
0.0 0.2 0.4 0.6 0.8 1.0
0
1
2
3
4
5
-20pt
in Scipy
I assume n_success successes out of nI recall that we use ppf to get the percentiles!I or even simpler, use interval
a = 1b = a
n_fail = n - n_successposterior_distr = stats.beta(n_success + a, n_fail + b)
p_low, p_high = posterior_distr.interval(0.95)
-20pt
example: political polling
I we ask 87 randomly selected Gothenburgers about whetherthey support the proposed aerial tramway line over the river
I 81 of them say yesI a 95% credible interval for the popularity of the tramway is
0.857 – 0.967
n_for = 81n = 87n_against = n - n_for
p_mle = n_for / n
posterior_distr = stats.beta(n_for + 1, n_against + 1)
print(’ML / MAP estimate:’, p_mle)print(’95% credible interval: ’, posterior_distr.interval(0.95))
-20pt
don’t forget your common sense
I I ask 14 Applied Data Science students about whether theysupport free transporation between Johanneberg andLindholmen, 12 of them say yes
I will I get a good estimate?
-20pt
recipe 2: mean of a normal
I we have some sample that we assume follows some normaldistribution; we don’t know the mean µ or the standarddeviation σ; the data points are independent
I can we make an interval estimate for the parameter µ?
I frequentist confidence intervals, but also Bayesian credibleintervals, are based on the t distributionI this is a bell-shaped distribution with longer tails than the
normal
2.00 2.25 2.50 2.75 3.00 3.25 3.50 3.75 4.000.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
I the t distribution has a parameter called degrees of freedom(df) that controls the tails
-20pt
recipe 2: mean of a normal
I we have some sample that we assume follows some normaldistribution; we don’t know the mean µ or the standarddeviation σ; the data points are independent
I can we make an interval estimate for the parameter µ?I frequentist confidence intervals, but also Bayesian credible
intervals, are based on the t distributionI this is a bell-shaped distribution with longer tails than the
normal
2.00 2.25 2.50 2.75 3.00 3.25 3.50 3.75 4.000.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
I the t distribution has a parameter called degrees of freedom(df) that controls the tails
-20pt
recipe 2: mean of a normal (continued)
I x_mle is the sample mean; the size of the dataset is n; thesample standard deviation is s
I we consider a t distribution:posterior_distr = stats.t(loc = x_mle, scale = s/np.sqrt(n), df = n-1)
2.00 2.25 2.50 2.75 3.00 3.25 3.50 3.75 4.000.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
I to get an interval estimate, select a 95% interval in thisdistribution
2.00 2.25 2.50 2.75 3.00 3.25 3.50 3.75 4.00
0.00
0.25
0.50
0.75
1.00
1.25
1.50
1.75
-20pt
example
I to demonstrate, we generate some data:
x = pd.Series(np.random.normal(loc=3, scale=0.5, size=500))
I a 95% confidence/credible interval for the mean:
mu_mle = x.mean()
s = x.std()n = len(x)
posterior_distr = stats.t(df=n-1, loc=mu_mle, scale=s/np.sqrt(n))
print(’estimate:’, mu_mle)print(’95% credible interval: ’, posterior_distr.interval(0.95))
-20pt
alternative: estimation using bayes_mvs
I SciPy has a built-in function for the estimation of mean,variance, and standard deviation:https://docs.scipy.org/doc/scipy-0.19.1/reference/generated/
scipy.stats.bayes_mvs.html
I 95% credible intervals for the mean and the std:
res_mean, _, res_std = stats.bayes_mvs(x, 0.95)
mu_est, (mu_low, mu_high) = res_meansigma_est, (sigma_low, sigma_high) = res_std
-20pt
recipe 3 (if we have time): brute force
I what if we have no clue about how our measurements aredistributed?I word error rate for speech recognitionI BLEU for machine translation
-20pt
the brute-force solution to interval estimates
I the variation in our estimate depends on the distribution ofpossible datasets
I in theory, we could find a confidence interval by consideringthe distribution of all possible datasets, but this can’t be donein practice
I the trick in bootstrapping – invented by Bradley Efron – is toassume that we can simulate the distribution of possibledatasets by picking randomly from the original dataset
-20pt
the brute-force solution to interval estimates
I the variation in our estimate depends on the distribution ofpossible datasets
I in theory, we could find a confidence interval by consideringthe distribution of all possible datasets, but this can’t be donein practice
I the trick in bootstrapping – invented by Bradley Efron – is toassume that we can simulate the distribution of possibledatasets by picking randomly from the original dataset
-20pt
bootstrapping a confidence interval, pseudocode
I we have a dataset D consisting of k itemsI we compute a confidence interval by generating N random
datasets and finding the interval where most estimates end up
repeat N timesD∗ = pick k items randomly from Dm = estimate on D∗
store m in a list Mreturn 2.5% and 97.5% percentiles of M
0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.980
500
1000
1500
2000
2500
3000
3500
4000
I see Wikipedia for different varieties
-20pt
overview
interval estimates
significance testing for the accuracy
comparing two classifiers
p-value fishing
-20pt
statistical significance testing for the accuracy
I in the assignment, you will consider two questions:I how sure are we that the true accuracy is different from 0.80?I how sure are we that classifier A is better than classifier B?
I we’ll see recipes that can be used in these two scenariosI these recipes work when we can assume that the “tests” (e.g.
documents) are independentI for tests in general, see e.g. Wikipedia
-20pt
comparing the accuracy to some given value
I my boss has told me to build a classifier with an accuracy ofat least 0.70
I my NB classifier made 40 correct predictions out of 50I so the MLE of the accuracy is 0.80
I based on this experiment, how certain can I be that theaccuracy is really different from 0.70?
I if the true accuracy is 0.70, how unusual is our outcome?
-20pt
null hypothesis significance tests (NHST)
I we assume a null hypothesis and then see how unusual(extreme) our outcome isI the null hypothesis is typically “boring”: the true accuracy is
equal to 0.7I the “unusualness” is measured by the p-value
I if the null hypothesis is true, how likely are we to see anoutcome as unusual as the one we got?
I the traditional threshold for p-values to be considered“significant” is 0.05
-20pt
the exact binomial test
I the exact binomial test is used when comparing an estimatedprobability/proportion (e.g. the accuracy) to some fixed valueI 40 correct guesses out of 50I is the true accuracy really different from 0.70?
I if the null hypothesis is true, then this experiment correspondsto a binomially distributed r.v. with parameters 50 and 0.70
I we compute the p-value as the probability of getting anoutcome at least as unusual as 40
-20pt
historical side note: sex ratio at birth
I the first known case where a p-value was computed involvedthe investigation of sex ratios at birth in London in 1710
I null hypothesis: P(boy) = P(girl) = 0.5I result: p close to 0 (significantly more boys)
“From whence it follows, that it is Art, not Chance, that governs.”(Arbuthnot, An argument for Divine Providence, taken from the constant
regularity observed in the births of both sexes, 1710)
-20pt
example
I 40 correct guesses out of 50I if the true accuracy is 0.70, is 40 out of 50 an unusual result?
0 10 20 30 40 500.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
outcome
the p-value is 0.16
-20pt
example
I 40 correct guesses out of 50I if the true accuracy is 0.70, is 40 out of 50 an unusual result?
0 10 20 30 40 500.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
outcome
I the p-value is 0.16, which isn’t “significantly” unusual!
-20pt
implementing the exact binomial test in Scipy
I assume we made x correct guesses out of nI is the accuracy significantly different from test_acc?I the p-value is the sum of the probabilities of the outcomes
that are at least as “unusual” as x:import scipy.stats
def exact_binom_test(x, n, test_acc):rv = scipy.stats.binom(n, test_acc)p_x = rv.pmf(x)p_value = 0for i in range(0, n+1):
p_i = rv.pmf(i)if p_i <= p_x:
p_value += p_ireturn p_value
I actually, we don’t have to implement it since there is afunction scipy.stats.binom_test that does exactly this!
-20pt
overview
interval estimates
significance testing for the accuracy
comparing two classifiers
p-value fishing
-20pt
comparing two classifiers
I I’m comparing a Naive Bayes and a perceptron classifierI we evaluate them on the same test setI the NB classifier had 186 correct out of 312 guessesI . . . and the perceptron had 164 correct guessesI so the ML estimates of the accuracies are 0.60 and 0.53,
respectivelyI but does this strongly support that the NB classifier is really
better?
-20pt
contingency table
I we make a table that compares the errors of the two classifiers:
NB correct NB incorrectperc correct A = 125 B = 39perc incorrect C = 61 D = 87
I if NB is about as good as the perceptron, the B and C valuesshould be similarI conversely if they are really different, B and C should differ
I are these B and C value unusual?
-20pt
McNemar’s test
I in McNemar’s test, we model the discrepancies (the B and Cvalues)
NB correct NB incorrectperc correct A = 125 B = 39perc incorrect C = 61 D = 87
I there are a number of variants of this testI the original formulation:
Quinn McNemar (1947). Note on the sampling error of thedifference between correlated proportions or percentages,Psychometrika 12:153-157.
I our version builds on the exact binomial test that we sawbefore
-20pt
McNemar’s test (continued)
NB correct NB incorrectperc correct A = 125 B = 39perc incorrect C = 61 D = 87
I the number of discrepancies is B+CI how are the discrepancies distributed?
I if the two systems are equivalent, the discrepancies should bemore or less evenly spread into the B and C boxes
I it can be shown that B would be a binomial random variablewith parameters B+C and 0.5
I so we can find the p-value (the “unusualness”) like this:p_value = scipy.stats.binom_test(B, B+C, 0.5)
I in this case it is 0.035, supporting the claim that NB is better
-20pt
alternative implementation
http://www.statsmodels.org/dev/generated/statsmodels.stats.contingency_tables.mcnemar.html
-20pt
overview
interval estimates
significance testing for the accuracy
comparing two classifiers
p-value fishing
-20pt
searching for significant effects
I scientific investigations sometimes operate according to thefollowing procedure:1. propose some hypothesis2. collect some data3. do we get a “significant” p-value over some null hypothesis?4. if no, revise hypothesis and go back to 3.5. if yes, publish your findings, promote them in the media, . . .
-20pt
searching for significant effects (alternative)
I or a “data science” experiment:1. you are given some dataset and told to “extract some
meaning” from it2. look at the data until you find a “significant” effect3. publish . . .
-20pt
searching for significant effects
I remember: if the null hypothesis is true, we will still see“significant” effects about 5% of the time
I consequence: if we search long enough, we will probably findsome effect with a p-value that is smallI even if this is just due to chance
-20pt
spurious correlations
Num
ber o
f people
killed b
y v
enom
ous sp
iders
Spelli
ng b
ee w
innin
g w
ord
Letters in winning word of Scripps National Spelling Beecorrelates with
Number of people killed by venomous spiders
Number of people killed by venomous spidersSpelling Bee winning word
1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
0 deaths
5 deaths
10 deaths
15 deaths
5 letters
10 letters
15 letters
tylervigen.com
-20pt
example
http://andrewgelman.com/2017/11/11/
student-bosses-want-p-hack-dont-even-know/
-20pt
“data dredging”: further reading
https://en.wikipedia.org/wiki/Data_dredging
https:
//en.wikipedia.org/wiki/Testing_hypotheses_suggested_by_the_data
-20pt
some solutions
I common senseI held-out data (a separate test set)I correcting for multiple comparisons
-20pt
Bonferroni correction for multiple comparisons
I assume we have an experiment where we carry out Ncomparions
I in the Bonferroni correction, we multiply the p-values of theindividual tests by N (or alternatively, divide the “significance”threshold by N)
-20pt
Bonferroni correction for multiple comparisons: example
-20pt
the rest of the week
I Wednesday: Naive Bayes and evaluation assignmentI Thursday: probabilistic clustering (Morteza)I Friday: QA hours (14–16)
top related