consuming statistical data

22
Consuming statistical data Badania Operacyjne

Upload: bertha

Post on 08-Feb-2016

37 views

Category:

Documents


0 download

DESCRIPTION

Consuming statistical data. Badania Operacyjne. Statistical pitfalls. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Consuming statistical  data

Consuming statistical data

Badania Operacyjne

Page 2: Consuming statistical  data

1. Confounding conditional probabilities• Suppose that you are about to board a flight and that

you fear the plane might be blown up by a bomb• An old joke suggests that you take a bomb on the plane

with you, because the probability of two bombs is really low

Statistical pitfallsAssuming independence of your action and the other passengers’ actions, the probability of there being two bombs conditional on your bringing one is equal to the probability of one bomb conditional on your not bringing one.

Page 3: Consuming statistical  data

Problem 2

• You are going to play roulette. You first sit there and observe, and you notice that the last five times it came up “black.”

• Would you bet on “red” or on “black”?

The same with betting on 1,2,3,4,5,6 in lotto state lottery.

Page 4: Consuming statistical  data

2. Ignoring base probabilities:Problem 1• You are concerned that might have a disease• You are going to take a test that:

– If you have the disease, it will show it with prob. 95%– If you don’t, the test might still be positive with prob. 10%

• Assume that you took the test and you tested positive.• What is the probability of actually having the disease?

Statistical pitfalls

Page 5: Consuming statistical  data

Statistical pitfalls

• Suppose the disease is known to be extinct, then P(D|T)=0• And if P(D)=0.9, then P(D|T)=0,9884

Page 6: Consuming statistical  data

Graphical intuition

Page 7: Consuming statistical  data

Another example - prejudice

Assume that most of the top squash players are Pakistani.

It does not mean that most Pakistani are top squash players.

Yet people often make this mistake.

Page 8: Consuming statistical  data

3. Biased samples (correlation)• 1936 US presidential elections Roosevelt

(democrat) vs Landon (republican)– Opinion polls in Literary Digest– The poll relied on car and telephone registration lists

• Another example: members of some ultraconservative party refuse to respond to the pollsters’ questions

Statistical pitfalls

Page 9: Consuming statistical  data

Statistical pitfalls3. Biased samples (sampling procedure)• Problem 4: Family size:

– I wish to find the average number of children in a family– I go to a school, randomly select several dozens children, and ask them how

many siblings they have– I compute the average

• The bias stems from my very choice of sampling chlidren in schools• We can immediately see that is wrong without any information on

correlation• Notice that the sample is not biased if you want to answer the question:

– „How many children (including yourself) do you grow up with”

Page 10: Consuming statistical  data

Statistical pitfalls3. Biased samples (sampling procedure)• Waiting time

– I wish to estimate the average time between the arrival of two consecutive buses– I go to the bus stop, measure the time, and multiply it by two

• A bus that happens to take longer to arrive has a higher probability to appear in my sample

• If I wished to estimate the waiting time for a passenger who arrives at the stop at a random moment, the sample would not be biased

Italians vs Europeans

Page 11: Consuming statistical  data

Problem 3

• A study of students’ grades in the United States showed that immigrants had, on average, a higher grade point average than US-born students. The conclusion was that Americans are not very smart, or at least do not work very hard, as compared with other nationalities.

• What do you think?

Page 12: Consuming statistical  data

Statistical pitfalls3. Biased samples (sampling procedure)• Problem 5: Winner’s curse:– We are conducting an auction for a good that has a common value (e.g.

oil field)– This common value is not known with certainty– Assume that each firm gets an estimate of the worth of the oil field and

submits it– The estimates they get are statistically unbiased– If you have 1 firm its expected payoff is zero– If you have several firms, the firm that wins the bid is likely to lose

money• Think of „winning the auction” as a sampling procedure• The winning bids that are sampled are not representative of the

whole „population” of bids

Page 13: Consuming statistical  data

Statistical pitfalls4. Regression to the mean• „Regression” refers to the process of fitting a curve

to datapoints, under the assumption that there is some inherent noise in the data generating process. Because of that noise simple curve is better than more complex curve that matches the datapoints exactly (overfitting)

• Historically, linear regression was first used to explain the height of men by the height of their fathers.

• The line was increasing but the slope was less than one – hence „regression”– The height depends on genes– And on all other things (in the absence of info about

them, let’s put them all together and call them noise)– Assume that the noise is independent of the father’s

height.– Then take a very tall man– He will pass to his son his genes but not the noise

Page 14: Consuming statistical  data

Statistical pitfalls4. Regression to the mean• Suppose that you select students by their grades on an

examination, and assign the best to a separate class• After a year you check their progress• You would expect them to do better than the average

student, • But you would also expect them, on average, to do below

their previous level.• This is beacuse of the way you selected them– Talent will be robust– Noise (luck on the day of exam, etc.) will not be

Page 15: Consuming statistical  data

Statistical pitfalls4. Regression to the mean• Your friend tells you that you must see a movie that’s just

out: „It’s the best movie I have ever seen”• You select a political leader or investment consultant

based on their past performance

Page 16: Consuming statistical  data

Problem 6[At a restaurant] • ANN: I hate it. It’s just like I told you: they

don’t make an effort anymore.• BARBARA: They?• ANN: Just taste it. It’s really bad food.

Don’t you remember how it was the first time we were here?

• BARBARA: Well, maybe you’re tired.• ANN: Do you like your dish?• BARBARA: Well, it isn’t bad. Maybe not as

good as last time, but…• ANN: You see? They first make an effort

to impress and lure us, and then they think that we’re anyway going to come back. No wonder that so many

restaurants shut down after less than a year.

• BARBARA: Well, I’m not sure that this restaurant is so new.

• ANN: It isn’t?• BARBARA: I don’t think so. Jim mentioned

it to me a long time ago, it’s only us who didn’t come here for so long.

• ANN: So how did they know they should have impressed us the first time and how did they know it’s our second time now? Do you think the waiter was telling the chef, “Two sirloins at no. 14, but don’t worry about it, they’re here for the second time”?

Page 17: Consuming statistical  data

Statistical pitfalls

5. Correlation and causation• Two variables are correlated if they tend to assume

high values together and low values together– We measure it by the covariance and correlation

coefficient• Causality is a much trickier concept because it

involves counterfactual, namely statement of the type:– „X is high and so is Y; but had X been low, Y would have

been low, too.”

Page 18: Consuming statistical  data

Problem 7

• Studies show a high correlation between years of education and annual income. Thus, argued your teacher, it’s good for you to study: the more you do, the more money you will make in the future.

• Is this conclusion warranted?1. More education more income XY2. More income more education YX3. Rich parents more education and more income

Z X,Y

Page 19: Consuming statistical  data

Problem 8

• In a recent study, it was found that people who did not smoke at all had more visits to their doctors than people who smoked a little bit. One researcher claimed: “Apparently, smoking is just like consuming red wine – too much of it is dangerous, but a little bit is actually good for your health!”

• Do you accept this conclusion?

Page 20: Consuming statistical  data

Statistical pitfalls5. Correlation and causation• You want to measure the effect that smoking (X) has on general health (Y)• Linear regression gives you correlation between the two• This correlation may be the result of:

– X affecting Y– Y affecting X– Another variable Z affecting both X and Y– Pure chance

• We can choose an instrument – a variable that:– Is correlated with X– Is not correlated with Y

• For example tobacco tax: If tobacco taxes only affect health because they affect smoking (holding other variables in the model fixed), correlation between tobacco taxes and health is evidence that smoking causes changes in health. An estimate of the effect of smoking on health can be made by also making use of the correlation between taxes and smoking patterns.

Page 21: Consuming statistical  data

Correlation

1 2 3 4 5 6 7 8 9 101112131415161718192021222324

-60

-40

-20

0

20

40

60

Series1Series2

1 2 3 4 5 6 7 8 9 1011121314151617181920212223240

10

20

30

40

50

60

Series1Series2

Page 22: Consuming statistical  data

Statistical pitfalls6. Statistical significanceProblem 9Comment on the following.• CHARLES: I don’t use a mobile phone anymore.• DANIEL: Really? Why?• CHARLES: Because it was found to be correlated with brain cancer.• DANIEL: Com’n, you can’t be serious. I asked an expert and they said that the effect is so

small that it’s not worth thinking about.• CHARLES: As long as you have something to think with. Do as you please, but I’m not going

to kill myself.• DANIEL: Fine, it’s your decision. But I tell you, the effects that were found were insignificant.• CHARLES: Insignificant? They were significant at the 5% level!

Even if the use of mobile phones increases the probability of brain tumors from 0.0000302 to 0.0000303 with large enough sample