an informal introduction to statistics in 2hcs.brown.edu/courses/cs195w/slides/introtostats.pdf ·...

An Informal Introduction to Statistics in 2h

Tim Kraska

Goal of this Lecture• This is not a replacement for a proper

introduction to probability and statistics• Instead, it only tries to convey the very basic

intuition behind some of the ideas• The risk of this lecture: Half knowledge can be

dangerous • Most slides are based on CS 155 (big thanks to

The Very Basic

Statistics ≠ Probability

Probability: mathematical theory that describes uncertainty.

Statistics: set of techniques for extracting useful information from data.

Probability Space

Probability Function

Tossing a (Fair) Coin

Ω = H ,T F = 2Ω = 22 = 4Events

F = , H , T , H ,T

Pr ( )= 0

Pr H ( )= 0.5

Pr T ( )= 0.5

Pr H ,T ( )= 1

Rolling a Dice

Ω = 1,2,3,4,5,6 F = 2Ω = 26Events

Pr ( )= 0

Pr 1 ( )= Pr 2 ( )= Pr 3 ( )= Pr 4 ( )= Pr 5 ( )= Pr 6 ( )= 16

Pr 1,2 ( )= Pr 1,3 ( )= Pr 1,4 ( )= Pr 1,5 ( )= Pr 1,6 ( )= 26

Independent Events

Tossing a (Fair) Coin Twice

Ω = HH ,HT ,TH ,TT F = 2Ω = 24Events

Pr ( )= 0

Pr HH ( )= Pr H ( )Pr H ( )= 0.5 × 0.5 = 0.25

Pr HT ( )= Pr TH ( )= Pr TT ( )= 0.25

Pr HT ,TT ( )= Pr HH ,TH ( )= 0.5

Pr HH ,HT ( )= Pr TH ,TT ( )= 0.5...

Conditional Probability

Computing Conditional Probabilities

Example - a posteriori probability

Law of Total Probability

In Class Exercises1. A fair coin was tossed 10 times and always

ended up on its head. What is the likelihood that it will end up tail next?

2. Stan has two kids. One of his kids is a boy. What is the likelihood that the other one is also a boy

Bayesian Statistics

Bayes’ Law

Bayes Theorem

P(H|D) =P(D|H) P(H)

PriorThe probability of the hypothesis being true before collecting data

MarginalWhat is the probability of collecting this data under all possible hypotheses?

LikelihoodProbability of collecting this data when our hypothesis is true

PosteriorThe probability of our hypothesis being true given the data collected

Deriving Bayes’ Law

B P(A B)

P(A B) = P(A|B) * P(B) = P(B|A) * P(A)U

P(B|A) * P(A)

P(B) P(A|B) =

Application: Finding a Biased Coin

Class Example: Drug Test• 0.4% of the Rhode Island population use

Marijuana* • Drug Test: The test will produce 99% true

positive results for drug users and 99% true negative results for non-drug users.

If a randomly selected individual is tested positive, what is the probability he or she is a user?

* http://medicalmarijuana.procon.org/view.answers.php?questionID=001199

Class Example: Drug Test• 0.4% of the Rhode Island population use Marijuana* • Drug Test: The test will produce 99% true positive results for

drug users and 99% true negative results for non-drug users.

If a randomly selected individual is tested positive, what is the probability he or she is a user?

P User +( )=P +User( )P User( )

P +( )

=P +User( )P User( )

P +User( )P User( )+ P + !User( )P !User( )

= 0.99 × 0.0040.99 × 0.004 + 0.01× 0.996

= 28.4%

Spam Filtering with Naïve Bayes

9/12/13 Bill Howe, UW 25

P spam words( )=P spam( )P words spam( )

P words( )

P spam viagra,rich,..., friend( )=P spam( )P viagra,rich,..., friend spam( )

P viagra,rich,..., friend( )

P spam words( )≈

P spam( )P viagra spam( )P rich spam( )…P friend spam( )P viagra,rich,..., friend( )

Bayesian Inference

P H E( )=P E H( )P H( )

P E( )

P Θ E∩α( )=P E Θ ∩α( )P Θ α( )

P E α( )

H HypothesisP H( ) Prior Probability

P H E( ) Posterior Probability

P E H( ) Probability of observing E given H, likelihood

P E( ) Model Evidence (marginal likelihood)

Random Variables

How to Model A Simple Game

I get $5 from you

You get $10 from me

Random Variables

Independence

Expectation

Linearity of Expectation

How to Model A Simple Game

I get $5 from you

You get $10 from me

Would you play this game?

Variance

So far we knew the distribution What if we do not?

Population N

Red/Blue/Green Lottery

Empirical Probability

fi = niN

= nini

Population N

fblue = 1020 fgreen = 4

20fred = 6

Population

σ 2 =xi − µ( )

i∑ 2

Population N

Red/Blue/Green Lottery

Sample n

Population vs. Sample

Population (parameter) Sample (Statistic) àEstimates

µ =xi

σ 2 =xi − µ( )

i∑ 2

xi − µ( )i

SN−12 =

xi − µ( )i

n −1

Big DataHow to calculate the Variance in 1-Pass

SN−12 =

xi − µ( )i

n −1

= 1n −1

∑ 2− 1n

= 1n −1

xi − x( )i

∑ 2− 1n

xi − x( )i

Law of Large Numbers

• Draw independent observations at random from any population with finite mean μ.

• As the number of observations increases, the sample mean approaches mean μ of the population.

• The more variation in the outcomes, the more trials are needed to ensure that is close to μ.

Weak Law of large numbers

Strong law of large numbers

X → µ

Pr limn→∞

Xn = µ( )= 1

Central Limit Theorem

Law of Large Numbers (Coin)

Convolution

Die 1: XDie 2: YDice 1+2 : Z = X Y

P Z = z( )= P(X = k)P(Y = z− k)k=−∞

Tossing 2 Dice

Distribution of X1: Die 1 or Die 2

1 2 3 4 5 6

Distribution of S2: 2 Dice

2 3 4 5 6 7 8 9 10 11 12

Distribution of S4

4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Distribution of S8

8 12 16 20 24 28 32 36 40 44 48

Distribution of S16

16 24 32 40 48 56 64 72 80 88 96

Distribution of S32

0.0 15

32 48 64 80 96 112 128 144 160 176 192

Distribution of X1

Distribution of S2

2 3 4 5 6

Distribution of S4

4 5 6 7 8 9 10 11 12

Distribution of S8

8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Distribution of S16

16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48

Distribution of S32

32 37 42 47 52 57 62 67 72 77 82 87 92

Distribution of X1

0 1 2 3 4 5

Distribution of S2

0 1 2 3 4 5 6 7 8 9 10

Distribution of S4

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Distribution of S8

0 4 8 12 16 20 24 28 32 36 40

Distribution of S16

0 8 16 24 32 40 48 56 64 72 80

Distribution of S32

0.0 15

0 16 32 48 64 80 96 112 128 144 160

Normal Distribution

Probability Density Function( ) ( )22 2/

21)( σµ

πσ−−= xexf

Probability Density Function (PDF) Cumulative Distribution Function (CDF)

Ν µ,σ 2( )

The Central Limit Theorem

1. The distribution of means will be approximately a normal distribution for larger sample sizes

2. The mean of the distribution of means approaches the population mean, μ, for large sample sizes

3. The standard deviation of the distribution of means approaches for large sample sizes, where σ is the standard deviation of the population and n is the sample size

The Central Limit Theorem Side Notes

1. For practical purposes, the distribution of means will be nearly normal if the sample size is larger than 30

2. If the original population is normally distributed, then the sample means will remain normally distributed for any sample size n, and it will become narrower

3. The original variable can have any distribution, it does not have to be a normal distribution

Shapes of Distributions as Sample Size Increases

Testing

Hypothesis TestingThe FDA or “science” needs to decide on a new theory, drug, treatment…• H0: The null hypothesis - the current theory,

drug, treatment, is as good or better• H1: The alternative hypothesis - the new theory,

drug, treatment, should replace the old oneResearchers do not know which hypothesis is true. They must make a decision on the basis of evidence presented.

Chap 10-73

What is a Hypothesis?

• A hypothesis is a claim (assumption) about a population parameter:

– population mean

• population proportion

Example: The mean monthly cell phone bill of this city is μ = $42

Example: The proportion of adults in this city with cell phones is p = .68

The Null Hypothesis, H0

3μ:H0 =

3μ:H0 = 3X:H0 =

Hypothesis Testing Process

Population

Claim: thepopulationmean age is 50.(Null Hypothesis:

REJECT

Supposethe samplemean age is 20: X = 20

SampleNull Hypothesis

20 likely if μ = 50?=IsIf not likely,

Now select a random sample

H0: μ = 50 )

Reason for Rejecting H0

Chap 10-77

Outcomes and Probabilities

Actual SituationDecision

Do NotReject

No error (1 - α )

Type II Error ( β )

RejectH0

Type I Error( )α

Possible Hypothesis Test Outcomes

H0 False H0 True

Key:Outcome

(Probability) No Error ( 1 - β )

Chap 10-78

Level of Significance and the Rejection Region

H0: μ ≥ 3 H1: μ < 3

H0: μ ≤ 3 H1: μ > 3

Represents

critical value

Lower-tail test

Level of significance = α

0Upper-tail test

Two-tail testRejection region is shaded

α /2αH0: μ = 3 H1: μ ≠ 3

Chap 10-79

p-Value Approach to Testing

• p-value: Probability of obtaining a test statistic more extreme ( ≤ or ≥ ) than the observed sample value given H0 is true

– Also called observed level of significance

– Smallest value of α for which H0 can be rejected

Chap 10-80

Reject H0

α = .10

Do not reject H0

Reject H0

Calculate the p-value and compare to α

p-Value

9/12/13 Bill Howe, Data Science, Autumn 2012 81http://www.newyorker.com/reporting/2010/12/13/101213fa_fact_lehrer

Anders Pape Møller, 1991“female barn swallows were far more likely to mate with male birds that had long, symmetrical feathers”“Between 1992 and 1997, the average effect size shrank by eighty per cent.”

Joseph Rhine, 1930s, coiner of the term extrasensory perceptionTested individuals with card-guessing experiments. A few students achieved multiple low-probability streaks. But there was a “decline effect” – their performance became worse over time.

Jonah Lehrer, 2010, The New YorkerThe Truth Wears off

John Davis, University of Illinois“Davis has a forthcoming analysis demonstrating that the efficacy of antidepressants has gone down as much as threefold in recent decades.”

Jonathan Schooler, 1990“subjects shown a face and asked to describe it were much less likely to recognize the face when shown it later than those who had simply looked at it.”The effect became increasingly difficult to measure.

Reason 1: Publication Bias

9/12/13 Bill Howe, Data Science, Autumn 2012 82

“In the last few years, several meta-analyses have reappraised the efficacy and safety of antidepressants and concluded that the therapeutic value of these drugs may have been significantly overestimated.”

Publication bias: What are the challenges and can they be overcome?Ridha Joober, Norbert Schmitz, Lawrence Annable, and Patricia BoksaJ Psychiatry Neurosci. 2012 May; 37(3): 149–152. doi: 10.1503/jpn.120065

“Although publication bias has been documented in the literature for decades and its origins and consequences debated extensively, there is evidence suggesting that this bias is increasing.”

“A case in point is the field of biomedical research in autism spectrum disorder (ASD), which suggests that in some areas negative results are completely absent”

(emphasis mine)

“… a highly significant correlation (R2= 0.13, p < 0.001) between impact factor and overestimation of effect sizes has been reported.”

Publication Bias

“decline effect”

“decline effect” = publication bias!

Background: Effect Size

• Expressed in relevant units

• Not just “significant” – how significant? • Used prolifically in meta-analysis to combine results from multiple

studies– But be careful – averaging results from different experiments can produce

nonsense

9/12/13 Bill Howe, UW 85Robert Coe, 2002, Annual Conference of the British Educational Research Association It's the Effect Size, Stupid: What effect size is and why it is important.

[Mean of experimental group] – [Mean of control group]

standard deviationEffect size =

Caveat: Other definitions of effect size exist: odds-ratio, correlation coefficient

Effect Size

• Standardized Mean Difference

09/12/2013 Bill Howe, UW 86

Lots of ways to estimate the pooled standard deviation

e.g., Hartung et al., 2008

Glass, 1976

Effect size: Cohen’s Heuristic

• Standardized mean difference effect size– small = 0.20– medium = 0.50– large = 0.80

09/12/2013 Bill Howe, UW 87

Reason 3: Multiple Hypothesis Testing

• If you perform experiments over and over, you’re bound to find something

• This is a bit different than the publication bias problem: Same sample, different hypotheses

• Significance level must be adjusted down when performing multiple hypothesis tests

P(detecting an effect when there is none) = α = 0.05

P(detecting an effect when it exists) = 1 – α

P(detecting an effect when it exists on every experiment) = (1 – α)k

P(detecting an effect when there is none on at least one experiment) = 1 – (1 – α)k

α = 0.05

“Familywise Error Rate”

Familywise Error Rate Corrections• Bonferroni Correction

– Just divide by the number of hypotheses

• Šidák Correction– Asserts independence

09/12/2013 Bill Howe, UW 90

Summary• Stochastic Variables• Basics in Statistics• Bayes’ Law• Central Limit Theorem• Law of Large Numbers• Testing

an informal introduction to statistics in 2hcs.brown.edu/courses/cs195w/slides/introtostats.pdf ·...

Documents

statistics, statistics assignment help, statistics help,...

pueblo, colorado - christ the king...betty mestas •...

naïve bayes classiﬁcation - brown...

kerkenveiling · 37 plantenbak ronald eekhof 38 worst...

r scidb julia - brown...

crowdsourced enumeration queries - amplab...crowdsourced...

60804 tel. 708-652-0948...† dolores kraska-merda by kraska...

crowdsourced enumeration...

statistics & econometrics statistics & econometrics...

kraska erozija

crowdsourcing - brown...

part 2 -learned data structures and...

piql: success- tolerant query processing in the cloud...

the case for learned index structures - arxiv · the case...

mapreduce)extension) - brown...

Продукция ООО «Нижегородские...

lawfare research paper series vol3no1...3 matthew t....

cure om patient registry update - jacqueline kraska

nonparametric statistics for social and behavioral...

bioassay of two ponds katie kraska stephen hesterberg matt...