assignment: review for exam #2, wednesday, oct. 19 chapters 10, 11, 12, 13, 16

51
In1996, the percentages of 16-24 yr old high school finishers enrolled in college were 49% for lower income families 63% for middle income families 78% for higher income families Assignment: Review for Exam #2, Assignment: Review for Exam #2, Wednesday, Oct. 19 Wednesday, Oct. 19 Chapters 10, 11, 12, 13, 16 Chapters 10, 11, 12, 13, 16

Upload: alexandra-zane

Post on 03-Jan-2016

35 views

Category:

Documents


0 download

DESCRIPTION

Oct. 17Statistic for the Day: In1996, the percentages of 16-24 yr old high school finishers enrolled in college were 49% for lower income families 63% for middle income families 78% for higher income families. Assignment: Review for Exam #2, Wednesday, Oct. 19 Chapters 10, 11, 12, 13, 16. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Assignment:  Review for Exam #2, Wednesday, Oct.  19 Chapters 10, 11, 12, 13, 16

Oct. 17 Statistic for the Day:In1996, the percentages of 16-24 yr old high

school finishers enrolled in college were49% for lower income families

63% for middle income families78% for higher income families

Assignment: Review for Exam #2, Wednesday, Oct. 19Assignment: Review for Exam #2, Wednesday, Oct. 19Chapters 10, 11, 12, 13, 16Chapters 10, 11, 12, 13, 16

Page 2: Assignment:  Review for Exam #2, Wednesday, Oct.  19 Chapters 10, 11, 12, 13, 16

weight calories

1 Big Montana 309 g 590 2 Giant Roast Beef 224 450 3 Regular Roast Beef 154 320 4 Beef ‘n Cheddar 195 440 5 Super Roast Beef 230 440 6 Junior Roast Beef 125 270 7 Chicken Breast Fillet 233 500 8 Chicken Bacon ‘n Swiss 209 550 9 Roast Chicken Club 228 470 10 Market Fresh Turkey Ranch Bacon 379 830 11 Market Fresh Ultimate BLT 293 780 12 Market Fresh Roast Beef Swiss 357 780 13 Market Fresh Roast Ham Swiss 357 700 14 Market Fresh Roast Turkey Swiss 357 720 15 Market Fresh Chicken Salad 322 770

Arby’s sandwiches

Page 3: Assignment:  Review for Exam #2, Wednesday, Oct.  19 Chapters 10, 11, 12, 13, 16

This type of plot, with two measurements per subject, is called a scatterplot (see p. 166).

150 200 250 300 350

300

400

500

600

700

800

Arby's Sandwiches

weight

calo

ries

Page 4: Assignment:  Review for Exam #2, Wednesday, Oct.  19 Chapters 10, 11, 12, 13, 16

The correlation measures the strength of the linear relationship between weight and calories.

More on this in the next class.

150 200 250 300 350

300

400

500

600

700

800

Arby's Sandwiches

weight

calo

ries

Correlation = 0.94

Page 5: Assignment:  Review for Exam #2, Wednesday, Oct.  19 Chapters 10, 11, 12, 13, 16

The best-fitting line through the data is called the regression line.

How should we describe this line?

150 200 250 300 350

300

400

500

600

700

800

Arby's Sandwiches

weight

calo

ries

Page 6: Assignment:  Review for Exam #2, Wednesday, Oct.  19 Chapters 10, 11, 12, 13, 16

The intercept is 18 in this case and the slope is 2.1.

In this class, you don’t need to know how to calculate the slope and intercept (but see p. 195 if you like formulas).

150 200 250 300 350

300

400

500

600

700

800

Arby's Sandwiches

weight

calo

ries

cal = 18 + (2.1)(wt)

Page 7: Assignment:  Review for Exam #2, Wednesday, Oct.  19 Chapters 10, 11, 12, 13, 16

-------------------------------------------------For example, if you have a 200g sandwich, on the average you expect to get about:

18 + (2.1)(200) = 18 + 420 = 438 calories--------------------------------------------------

For a 350g sandwich:

18 + (2.1)(350) = 18 + 735 = 753 calories

calories = 18 + (2.1)(weight in grams)

intercept slope

Page 8: Assignment:  Review for Exam #2, Wednesday, Oct.  19 Chapters 10, 11, 12, 13, 16

calories = 18 + (2.1)(weight in grams)

For every extra gram of weight, you expect an increase of 2.1 calories in your Arby’s sandwich.

Interpretation of slope: Expected increase in response for every unit increase (increase of one) in explanatory.

intercept slope

Page 9: Assignment:  Review for Exam #2, Wednesday, Oct.  19 Chapters 10, 11, 12, 13, 16

Facts about Correlation: +1 means perfect increasing linear relationship+1 means perfect increasing linear relationship -1 means perfect decreasing linear relationship-1 means perfect decreasing linear relationship 0 means no linear relationship0 means no linear relationship + means increasing together+ means increasing together - means one increases and the other decreases- means one increases and the other decreases

Page 10: Assignment:  Review for Exam #2, Wednesday, Oct.  19 Chapters 10, 11, 12, 13, 16

Strength vs. statistical significance

Even a weak relationship can be statistically significant Even a weak relationship can be statistically significant (if it is based on a large sample)(if it is based on a large sample)

Even a strong relationship can be statistically Even a strong relationship can be statistically insignificant (if it is based on a small sample)insignificant (if it is based on a small sample)

Page 11: Assignment:  Review for Exam #2, Wednesday, Oct.  19 Chapters 10, 11, 12, 13, 16

Regression potential pitfalls:Sometimes we see strong relationship in absurd examples; two seemingly unrelated variables have a high correlation.

This signals the presence of a third variable that is highly correlated with the other two (confounding). Remember that correlation does not imply causation.

Also: If you use a regression for prediction, do not extrapolate too far beyond the range of the observed data.

Page 12: Assignment:  Review for Exam #2, Wednesday, Oct.  19 Chapters 10, 11, 12, 13, 16

Vocabulary vs Shoe Size

65432

2500

2000

1500

1000

500

0

Shoe Size

kno

wn

Wo

rds

Y = -806 + 555 X

Regression Plot

Correlation = .985

Page 13: Assignment:  Review for Exam #2, Wednesday, Oct.  19 Chapters 10, 11, 12, 13, 16

OutliersOutliers are data that are not compatiblewith the bulk of the data.

They show up in graphical displaysas detached or stray points.

Sometimes they indicate errors in data input. Some experts estimate that roughly5% of all data entered is in error.

Sometimes they are the most importantdata points.

Page 14: Assignment:  Review for Exam #2, Wednesday, Oct.  19 Chapters 10, 11, 12, 13, 16

Put Options (NYTimes, September 26, 2001)

Put options on stocks give buyers the right to sell stock at a specified price during a certain time. They rise in value if the underlying stock falls below the strike price.

The value of puts on airline stocks soared on Sept. 17 when U.S. stock and options markets reopened after a four-day closure, as airline stocks slid as much as 40 percent.

American Airlines was at $32 prior to attack. Suppose a terrorist buys a put option (at say $5 per share) to have the right to sell at $25. The price after the attack was at $16. That put option is now more valuable.

Page 15: Assignment:  Review for Exam #2, Wednesday, Oct.  19 Chapters 10, 11, 12, 13, 16

0 2500 5000 7500

-1000

0

1000

machine

abse

ntee

absentee = -182.575 + 0.295319 machine

- 0.0000285 machine**2

S = 294.363 R-Sq = 62.0 % R-Sq(adj) = 57.8 %

Regression

95% PI

Regression Plot

R wins machine (D minus R negative for machine)D wins absentee (D minus R positive for absentee)

From story on p. 442

Page 16: Assignment:  Review for Exam #2, Wednesday, Oct.  19 Chapters 10, 11, 12, 13, 16

0 2000 4000 6000 8000 10000

1.0

1.5

2.0

2.5

3.0

3.5

4.0

Exercise minutes vs. GPA

Exercise

GP

A

A

B

Black lineBlack line:

With A and B

Red lineRed line:

Without A, with B

Green lineGreen line:

Without A or B

Outliers affect regression lines and correlation (these data aren’t real):

Page 17: Assignment:  Review for Exam #2, Wednesday, Oct.  19 Chapters 10, 11, 12, 13, 16

Two categorical variables: Explanatory variable: SexResponse variable: Body Pierced or Not

Survey question:Have you pierced any other part of your body?(Except for ears)

Research Question: Is there a significant difference between women and men at PSU in terms of body pierces?

Page 18: Assignment:  Review for Exam #2, Wednesday, Oct.  19 Chapters 10, 11, 12, 13, 16

Data:

NoNo YesYes

WomenWomen 8686 5252 138138

MenMen 7777 55 8282

163163 5757 220220

Body Pierced?

SexExplanatory:

Response:

From STAT 100, fall 2005 (missing responses omitted)

Page 19: Assignment:  Review for Exam #2, Wednesday, Oct.  19 Chapters 10, 11, 12, 13, 16

Response: body pierced? no yes All female 62.32% 37.68% 100.00% male 93.90% 6.10% 100.00%

All 74.09% 25.91% 100.00%

Percentages

Research question: Is there a significant differenceBetween women and men? (i.e., between 66.67% and 91.35%)

62.32% = 86 / 13893.90% = 77 / 82

Page 20: Assignment:  Review for Exam #2, Wednesday, Oct.  19 Chapters 10, 11, 12, 13, 16

The Debate:

The research advocate claims that there is a significant difference.

The skeptic claims there is no real difference. The data differences simply happen by chance, since we’ve selected a random sample.

Page 21: Assignment:  Review for Exam #2, Wednesday, Oct.  19 Chapters 10, 11, 12, 13, 16

The strategy for determining statistical significance: First, figure out what you expect to see if there is no difference First, figure out what you expect to see if there is no difference

between females and malesbetween females and males Second, figure out how far the data is from what is expected.Second, figure out how far the data is from what is expected. Third, decide if the distance in the second step is large.Third, decide if the distance in the second step is large. Fourth, if large then claim there is a statistically significant Fourth, if large then claim there is a statistically significant

difference.difference.

Page 22: Assignment:  Review for Exam #2, Wednesday, Oct.  19 Chapters 10, 11, 12, 13, 16

Rows: Sex Columns: Marijuana No Yes All Female 56 76 132 Male 31 46 77 All 87 122 209

Exercise: Follow the 4 steps and answer theResearch Question: Is there a statistically significant difference between males and females in terms of the percent who have used marijuana?

Data from STAT 100 fall 2005

Page 23: Assignment:  Review for Exam #2, Wednesday, Oct.  19 Chapters 10, 11, 12, 13, 16

Step 1: Find expected counts if the skeptic is correct

This step is based on the marginal totals:

132 8754.95

209

NoNo YesYes

WomenWomen AA BB 132132

MenMen CC DD 7777

8787 122122 209209

AA = (Repeat for B, C, D)

Page 24: Assignment:  Review for Exam #2, Wednesday, Oct.  19 Chapters 10, 11, 12, 13, 16

Step 1 cont’d

Repeat the process for B (and then C and D):

132 12277.05

209

NoNo YesYes

WomenWomen 54.9554.95 BB 132132

MenMen CC DD 7777

8787 122122 209209

BB =Or you can simply subtract: 132 – 54.95 = 77.0577.05

Page 25: Assignment:  Review for Exam #2, Wednesday, Oct.  19 Chapters 10, 11, 12, 13, 16

Marijuana? No Yes All Female 56 76 132 54.95 77.05 132.00 Male 31 46 77 32.05 44.95 77.00 Total 87 122 209

Step 1 cont’dGreen: Observed counts

Red: Expected counts if skeptic is correct.

Page 26: Assignment:  Review for Exam #2, Wednesday, Oct.  19 Chapters 10, 11, 12, 13, 16

Step 2: How far are the data (observed counts) from what is expected?

254.95

54.95

( )620

5.0

277.05

77.05

( )614

7.0

232.05

32.05

( )134

3.0

244.95

44.95

( )625

4.0

Chi-Sq = 0.020 + 0.014 + 0.034 + 0.025 = 0.093

Green: Observed countsRed: Expected counts if skeptic is correct.

Page 27: Assignment:  Review for Exam #2, Wednesday, Oct.  19 Chapters 10, 11, 12, 13, 16

Step 3: Is the distance in step 2 large?

Something is large when it is in the outer 5% tail of the appropriate distribution.

0 1 2 3 4 5 6

0.0

0.5

1.0

1.5

2.0

95% onthis side

5% onthis side

Cutoff=3.84

Chi-squared distribution with 1 degree of freedom:

If chi-squared statistic is larger than 3.84, it is declared large and the research advocate wins.

Our chi-squared value:

0.093 (from Step 2)

Page 28: Assignment:  Review for Exam #2, Wednesday, Oct.  19 Chapters 10, 11, 12, 13, 16

Hence, the difference: 57.6% of women versus 59.7% of men is not statistically significant in this case.(Sample size has been automatically considered!)

Step 4: If distance is large, claim statistically significant difference.

Rows: Sex Columns: marijuana

No Yes All Female 56 76 132 42.4% 57.6% 100.0% Male 31 46 77 40.3% 59.7% 100.0%

Page 29: Assignment:  Review for Exam #2, Wednesday, Oct.  19 Chapters 10, 11, 12, 13, 16

How many degrees of freedom here?

Too YoungToo Young NoNo YesYes

WomenWomen One df Two df 135135

MenMen 8181

6969 3535 112112 216216

Degrees of freedom (df) always equal

(Number of rows – 1) × (Number of columns – 1)

Page 30: Assignment:  Review for Exam #2, Wednesday, Oct.  19 Chapters 10, 11, 12, 13, 16

Health studies and riskResearch question: Do strong electromagnetic fieldscause cancer?

50 dogs randomly split into two groups: no field, yes fieldThe response is whether they get lymphoma.

Rows: mag field Columns: cancer no yes All no 20 5 25yes 10 15 25 All 30 20 50

Page 31: Assignment:  Review for Exam #2, Wednesday, Oct.  19 Chapters 10, 11, 12, 13, 16

Terminology and jargon:In the mag field group, 15/25 of the dogs got cancer. Therefore, the following are all equivalent:

1. 60% of the dogs in this group got cancer.

2. The proportion of dogs in this group that got cancer is 0.6.

3. The probability that a dog in this group got cancer is 0.6.

4. The risk of cancer in this group is 0.6

And one more: The odds of cancer in this group are 3/2.

Page 32: Assignment:  Review for Exam #2, Wednesday, Oct.  19 Chapters 10, 11, 12, 13, 16

1. Identify the ‘bad’ response category: In this example, cancer

2. Treatment risk: 15 / 25 or .60 or 60%

3. Baseline risk: 5 / 25 or .20 or 20%

4. Relative risk: Treatment risk over Baseline risk = .60 / .20=3 That is, the treatment risk is three times as large as the baseline risk.

5. Increased risk: By how much does the risk increase for treatment as compared to control? (.60 - .20) / .20 = 2 or 200% That is, the risk is 200% higher in the treatment group.

6. Odds ratio: Ratio of treatment odds to baseline odds. (15/10) / (5/20) turns out to be 6. That is, the treatment odds are six times as large as the baseline odds.

More terminology and jargon:

Page 33: Assignment:  Review for Exam #2, Wednesday, Oct.  19 Chapters 10, 11, 12, 13, 16

Final note:

When the chi-squared test is statistically significantthen it makes sense to compute the various riskstatements.

If there is no statistical significance then the skepticwins.

There is no evidence in the data for differences in risk for the categories of the explanatory variable.

Page 34: Assignment:  Review for Exam #2, Wednesday, Oct.  19 Chapters 10, 11, 12, 13, 16

Marijuana? No Yes All Female 56 76 132 54.95 77.05 132.00 Male 31 46 77 32.05 44.95 77.00 Total 87 122 209

Recall marijuana example

Chi-Sq = 0.020 + 0.014 + 0.034 + 0.025 = 0.093

SO THE SKEPTIC WINS. But what if we observed a much larger sample? Say, 100 times larger?

Page 35: Assignment:  Review for Exam #2, Wednesday, Oct.  19 Chapters 10, 11, 12, 13, 16

Marijuana? No Yes All Female 5600 7600 13200 5495 7705 13200 Male 3100 4600 7700 3205 4495 7700 Total 8700 12200 20900

Marijuana example, larger sample:

Chi-Sq = 2.0 + 1.4 + 3.4 + 2.5 = 9.3

NOW THE RESEARCH ADVOCATE WINS.

Page 36: Assignment:  Review for Exam #2, Wednesday, Oct.  19 Chapters 10, 11, 12, 13, 16

Practical significance

In the marijuana example, 58% of women and 60% of men reported that they had tried marijuana. This size of difference, even if it is really in the population, is probably uninteresting. Yet we have seen that a large sample size can make it statistically significant.

Hence, in the interpretation of statistical significance, we should also address the issue of practical significance.

In other words, we should answer the skeptic’s secondquestion: WHO CARES?

Page 37: Assignment:  Review for Exam #2, Wednesday, Oct.  19 Chapters 10, 11, 12, 13, 16

100 200 300 400 500 600

20

40

60

80

Price vs. # of pages for 15 books

pages

pric

eSimpson’s paradox (for quantitative variables)

Correlation= -.312

Example 11.4, pp. 204-205

Page 38: Assignment:  Review for Exam #2, Wednesday, Oct.  19 Chapters 10, 11, 12, 13, 16

Simpson’s paradox (for quantitative variables)

Correlation= -.312

Example 11.4, pp. 204-205

100 200 300 400 500 600

20

40

60

80

Price vs. # of pages for 15 books

pages

pric

e

H

H

H

H

S

H

SS S

H

S

H

H

S S

H Correlation= .348

S Correlation= .637

Page 39: Assignment:  Review for Exam #2, Wednesday, Oct.  19 Chapters 10, 11, 12, 13, 16

Simpson’s paradox for categorical variables, as seen in video

NumberNumber PercentPercent

MenMen 198 / 360198 / 360 55%55%

WomenWomen 88 / 20088 / 200 44%44%

NumberNumber PercentPercent

MenMen 18 / 12018 / 120 15%15%

WomenWomen 24 / 12024 / 120 20%20%

NumberNumber PercentPercent

MenMen 180 / 240180 / 240 75%75%

WomenWomen 64 / 8064 / 80 80%80%

Overall admitted to City U.

Business (hard) Law (easy)

Women better in each, but more men apply to easier law school!

Page 40: Assignment:  Review for Exam #2, Wednesday, Oct.  19 Chapters 10, 11, 12, 13, 16

Rules: For combining probabilities 0 < Probability < 1

1. If there are only two possible outcomes, then their probabilities must sum to 1.

2. If two events cannot happen at the same time, they are called mutually exclusive. The probability of at least one happening (one or the other) is the sum of their probabilities. [Rule 1 is a special case of this.]

3. If two events do not influence each other, they are called independent. The probability that they happen at the same time is the product of their probabilities.

4. If the occurrence of one event forces the occurrence of another event, then the probability of the second event is always at least as large as the probability of the first event.

Page 41: Assignment:  Review for Exam #2, Wednesday, Oct.  19 Chapters 10, 11, 12, 13, 16

Rule 1: If there are only two possible outcomes, then their probabilities must sum to 1.According to Example 3, page 302: P(lost luggage) = 1/176 = .0057Thus, P(luggage not lost) = 1 – 1/176 = 175/176 = .9943

The point of rule 1 is that P(lost) + P(not lost) = 1so if we know P(lost), then we can find P(not lost).

Sounds simple, right? It can be surprisingly powerful.

Page 42: Assignment:  Review for Exam #2, Wednesday, Oct.  19 Chapters 10, 11, 12, 13, 16

Rule 2: If two events cannot happen at the same time, they are called mutually exclusive.

Example 5, page 303:

Suppose P(A in stat) = .50 and P(B in stat) = .30.Then P( A or B in stat) = .50 + .30 = .80

Note that the events ‘A in stat’ and ‘B in stat’ are mutually exclusive. Do you see why?

In this case, the probability of at least one happening is the sum of their probabilities. [Rule 1 is a special case of this.]

Page 43: Assignment:  Review for Exam #2, Wednesday, Oct.  19 Chapters 10, 11, 12, 13, 16

Rule 3: If two events do not influence each other, they are called independent.

In this case, the probability that they happen at the same time is the product of their probabilities.

Example 8, page 303:

Suppose you believe that P(A in stat) = .5 and P(A in history) = .6.

Further, you believe that the two events are independent, so that they do not influence each other.

Then P(A in stat and A in history) = (.5)×(.6) = .3

Is this a reasonable assumption?

Page 44: Assignment:  Review for Exam #2, Wednesday, Oct.  19 Chapters 10, 11, 12, 13, 16

Rule 4: If the occurrence of one event forces the occurrence of another event, then the probability of the second event is always at least as large as the probability of the first event.

If event A forces event B to occur, then P(A) < P(B)

Special case: P(E and F) < P(E)

P(E and F) < P(F)

(because ‘E and F’ forces E to occur).

Page 45: Assignment:  Review for Exam #2, Wednesday, Oct.  19 Chapters 10, 11, 12, 13, 16

Two laws (only one of them valid):

Law of large numbers: Over the long haul, we expect Law of large numbers: Over the long haul, we expect about 50% heads (this is true).about 50% heads (this is true).

““Law of small numbers”: If we’ve seen a lot of tails in a Law of small numbers”: If we’ve seen a lot of tails in a row, we’re more likely to see heads on the next flip (this row, we’re more likely to see heads on the next flip (this is completely bogus).is completely bogus).

Remember: The law of large numbers OVERWHELMS; it does not COMPENSATE.

Page 46: Assignment:  Review for Exam #2, Wednesday, Oct.  19 Chapters 10, 11, 12, 13, 16

The game of Odd Man

Consider the “odd man” game. Three people at lunch toss a coin. The odd man has to pay the bill.

You are the odd man if you get a head and the other two have tails or if you get a tail and the other two have heads. Notice that there will not always be an odd man – this occurs if flips come up HHH or TTT.

P(no odd man) = P(HHH or TTT) = P(HHH) + P(TTT) since HHH, TTT are mutually exclusive = (1/2)3 + (1/2)3 since H,H,H are independent (as are T,T,T) =1/8 + 1/8 = .25

Thus, P(there is an odd man) = 1 – P(no odd man) = 1 - .25 = .75

Page 47: Assignment:  Review for Exam #2, Wednesday, Oct.  19 Chapters 10, 11, 12, 13, 16

P(odd man occurs on the third try)

= P(miss, miss, hit) in that order! That’s the only way. (See why?)

= P(miss) P(miss) P(hit) since each try is independent of the others.

= [P(miss)]2 P(hit)

= [.25]2 .75

= .047 This is the final answer: The probability that the odd man occurs exactly on the third try (after two unsuccessful tries).

Play until there is an odd man. What is the probability this will take exactly three tries?

Page 48: Assignment:  Review for Exam #2, Wednesday, Oct.  19 Chapters 10, 11, 12, 13, 16

Expectation

(Probability of winning: 244/495, or 49.3%)(Probability of winning: 244/495, or 49.3%)

What if you bet $10 on a game of craps? What is your expected profit?

You win $10 with probability .493

You lose $10 with probability .507

Expected profit: .493($10) + .507(-$10) = - $0.14

Page 49: Assignment:  Review for Exam #2, Wednesday, Oct.  19 Chapters 10, 11, 12, 13, 16

Casino winnings, 10,000 games per day

Casino winnings for 1000 days

Fre

qu

en

cy

-2000 0 1000 2000 3000 4000

050

100

150

200

Expectation = $1400

Page 50: Assignment:  Review for Exam #2, Wednesday, Oct.  19 Chapters 10, 11, 12, 13, 16

Casino winnings, 100,000 games a day

Casino winnings for 1000 days

Fre

qu

en

cy

5000 10000 15000 20000 25000

050

100

150

200

250

Expectation = $14,000

Note: Now all values are positive

Page 51: Assignment:  Review for Exam #2, Wednesday, Oct.  19 Chapters 10, 11, 12, 13, 16

Your winnings, a single game

Thus, the expected value does not have to be a possible value for any individual case.

We already calculated the expectation to be 14 cents. But you can’t lose 14 cents in one game; you either win 10 dollars or lose 10 dollars.