statistics and data analysis - new york...

Assignment 1

1

Statistics and Data Analysis

Professor William Greene Phone: 212.998.0876 Office: KMC 7-90 Home page: http://people.stern.nyu.edu/wgreene Email: [email protected] Course web page: http://people.stern.nyu.edu/wgreene/Statistics/Outline.htm

Assignment 1 Solutions

Notes: (1) The data sets for this problem set (and for the other problem sets for this course) are all stored on the home page for this course. You can find links to all of them on the course outline, at the bottom with the links to the problem sets themselves.

(2) In the exercises below (and in the other problem sets), the initials HOG refer to the textbook Basic Statistical Ideas for Managers, by Hildebrand, Ott and Gray.

Part I. Describing Data 1. Consider the following values: 20 11 14 12 17 14 10 23 15 11 17 10 18 18 13 18

Find the mean, median, and mode for these data.

SOLUTION: There are 16 values, and these 16 values have a total of 241, so the mean (or

average) is 241 ÷ 16 = 15.0625. If you sort the values into ascending order, you get

10 10 11 11 12 13 14 14 15 17 17 18 18 18 20 23

↑ The median of the 16 values occurs between the 8th and 9th largest, at the position indicated by

the arrow. We give 14.5 as the median. The value 18 occurs three times, more often than any

other value, so we report 18 as the mode.

2. This is Exercise HOG 2.1, page 23. The data are available on the course website as HOG-Ex0201.mpj.An automobile manufacturer routinely keeps records on the number of finished

(passing all inspections) cars produced per eight-hour shift. The data for the last 28 shifts are 366 390 324 385 380 375 384 383 375 339

360 386 387 384 379 386 374 366 377 385

381 359 363 371 379 385 367 364

a. What is the average number of finished cars per shift based on these data?

b. Construct a histogram for these data.

IOMS Department

http://people.stern.nyu.edu/wgreene

mailto:[email protected]

http://people.stern.nyu.edu/wgreene/Statistics/Outline.htm

Assignment 1

2

c. The data above are the results from observing 28 eight hour shifts. You are about to observe a

29th. What would be a good guess of how many will be observed?

d. Suppose the 29th shift was expected to be a very productive one – with large output. What

would be a good guess of the number of finished cars on a very good day?

.a. 373.4

.b. SOLUTION: There are many ways to make a histogram, including by hand. Here’s the default

from Minitab:

Minitab selected intervals of width 10 and then centered them on the labels. It might look nicer to

have the bars between cutpoints:

This was achieved by clicking on the horizontal scale, then on the binning tab, selecting

Cutpoint, and choosing Midpoint/cutpoint positions as 320:400/10. (This is an instance of the

Minitab convention start:end/step.)

If you drew this by hand, you might have selected a different strategy, but the general appearance

should be similar. We’re asked for an explanation of this shape. We’re going to have to come up

with an extra-statistical argument, meaning a statement that goes beyond what the data are telling

us directly. One such statement is that the factory regularly produces automobiles in the 350-400

range, but every now and then something goes very wrong, with the production dropping by

about 40 to 50 cars. The data tell us directly that most of the time the production is in the range

350-400, but now and then the production drops to 320-340. The data do not tell us the reason

why this drop occurs; any such explanation goes beyond the statistical information.

.c. The mean of 373.4 would be a good guess. The median of 378 would be also.

Assignment 1

3

.d. The maximum of 390 might be tempting. But, it is hard to expect a randomly chosen day to

be as good as the best day in a sample, especially as the number of observations gets large. It

would be good to move back from the extreme point. A slightly lower value such as 386 or 387

might be a better guess.

3. Which of the two samples in each set has the higher standarddeviation. You can tell by

looking at the data. It is not necessary to do any computation to answer this question. Explain

your reasoning for each answer.

Set 1 Sample A: 16, 16, 16, 16, 16

Sample B: 15, 16, 16, 16, 16

Set 2 Sample A: 20, 25, 25, 25, 30

Sample B: 15, 25, 25, 25, 35

Set 3 Sample A: 20, 20, 30, 40, 40

Sample B: 20, 25, 30, 35, 40

Set 1. S.D. of Sample A is 0.0. It is something greater than 0 for sample B

Set 2. Sample B is the same as A in the middle, but the leftmost observation is smaller and the

rightmost observartion is larger than in A. So sample B has the larger standard deviation.

Set 3. Observations 1,3,5 are the same in both. The means are the same, 35. Observations 1,3,5

are the same, observations 2 and 4 contribute (20-30)2 and (40-30)2 = 200 in A. Observations 2

and 4 contribute (25-30)2 and (35-302 = 50 in B, so A has the larger standard deviation.

4. This is exercise HOG, problem 2.23, page 42. The data file is HOG-Ex0222.mpj on the

course outline. Data on 60 telephone operators in terms of number of call requests processed in a

workdaywere analyzed using Minitab.

Descriptive Statistics: Cleared Variable N Mean SE Mean StDev

Cleared 60 794.23 4.42 34.25

Minimum Q1 Median Q3 Maximum

601.00 789.00 799.00 807.75 844.00

Data Display Cleared

797 794 817 813 817 793 762 719 804 811 837 804 790

796 807 801 805 811 835 787 800 771 794 805 797 724

820 601 817 801 798 797 788 802 792 779 803 807 789

787 794 792 786 808 808 844 790 763 784 739 805 817

804 807 800 785 796 789 842 829

a. Calculate the “mean plus-or-minus 1 standard deviation” interval used in theEmpirical Rule

discussed in class.

b. Of the 60 scores in “cleared,” 51 fall within the 1 standard deviation interval. How does this

result compare with the theoretical value of the Empirical Rule?

SOLUTION: The interval x ± s is here 794.23 ± 34.25, or 759.98 to 824.48. The fraction falling

within this interval is 51/60 = 85%. This is much larger than the expected 67%, so something

must have inflated the standard deviation. The lonely single value 601 is likely the problem. If

you set aside this single value, you’ll find that the standard deviation of the remaining 59 values

is 23.21, a huge reduction from the overall standard deviation of 34.25. In general, we do not

recommend that you set aside values just because they are weird.

Assignment 1

4

5. (Application) The data file WHO-HealthStudy.mpj (a Minitab project file) contains a

famous data set. These data were used in the World Health Organization’s 2000 comparison of

the health care systems in 191 countries – nearly the entire world – that was widely discussed in

the popular press (including on the front page of the New York Times) If you’ve seen Michael

Moore’s movie Sicko, or seen the trailer, there is a point at which he takes out a study on a

clipboard and shows you how the United States ranked 37th in the world in “health care.” These

are part of the data that were used to do the study. The extract of the data file in the Minitab

project contains 12data columns, as shown below for the first few countries

Assignment 1

5

The variables in the file are

DALE = disability adjusted life expectancy

EDUC = average years of education

GINI = a measure of income inequality (low numbers are bad)

POPDEN = the population density, people per square kilometer

GDPC = per capita gross domestic product (country income)

GEFF = World Bank measure of the effectiveness of the government

VOICE = World Bank indicator of how democratic the country is

OECD = an indicator of whether the country is in the OECD. (OECD is the

United Nations Organization for Economic Cooperation and

Development. Notwithstanding its lofty title, it is mainly a group of

the world’s wealthiest countries.)

COMP = an equally weighted average of survey results on five objectives

(Health, Health distribution, Responsiveness, Responsiveness in

Distribution, Fairness in financing).

EFFICIENCY = estimated overall efficiency (WHO/Paper 30, based on COMP)

PCHEXP = per capita health care expenditure (public)

PUBSHARE = proportion of total health expenditure paid for by the government

a. Let’s compare the incomes of the 30 OECD countries with the incomes of the 161 other

countries. A box-plot will be useful. Use Graph -> Boxplot (with groups), then graph

variable GDPC with group variable OECD. Note that OECD = 1 is the OECD and OECD =

0 is the other countries. What do you find?

b. Present a description of the variables DALE and GDPC using the tools discussed in class.

(Descriptive statistics will include means and standard deviations and medians. Graphical

tools include histograms and box plots.)

c. Does higher income buy higher life expectancy? Produce a scatter plot of DALE (on the Y

axis) against GDPC (on the X axis). What do you find?

d. Does education produce higher income? Produce a scatter plot of EDUC (on the X axis

against GDPC. What do you find? What conclusion do you draw?

e. Do higher levels of education appear to be associated with higher life expectancy?

SOLUTION a. The figure tells the story

Assignment 1

6

b. The full set of descriptive statistics is given below.

The two variables have different scales, so we cannot plot them on the same figure (at least not

reasonably.) Here is the box plot of the two together, for example.

Da

ta

GDPCDALE

30000

25000

20000

15000

10000

5000

0

Boxplot of DALE, GDPC

Here is a possibly informative histogram of DALE.

Assignment 1

7

The boxplot of DALE provides only five benchmarks (minimum, Q1, median, Q3, maximum).

We had the number from Descriptive Statistics, and the boxplot makes these into a visual display:

DA

LE

80

70

60

50

40

30

20

Boxplot of DALE

Minitab provides a nice summary for a variable, using Stat -> Basic Statistics -> Graphical

Summary

3000024000180001200060000

Median

Mean

800070006000500040003000

1st Q uartile 1607.1

Median 3445.9

3rd Q uartile 8476.8

Maximum 30264.3

5586.8 7632.0

2812.0 4347.7

6511.2 7965.5

A -Squared 17.28

P-V alue < 0.005

Mean 6609.4

StDev 7164.8

V ariance 51334756.9

Skewness 1.48158

Kurtosis 1.07107

N 191

Minimum 521.5

A nderson-Darling Normality Test

95% C onfidence Interv al for Mean

95% C onfidence Interv al for Median

95% C onfidence Interv al for StDev95% Confidence Intervals

Summary for GDPC

Assignment 1

8

The income variable shows the familiar skewness pattern. The skewness measure that we

discussed in class is given above as 1.458, which is fairly large. Later in the semester, we will

discuss how to decide if this is statistically “large.”

Here is a similar picture for DALE

75.067.560.052.545.037.530.0

Median

Mean

62.560.057.555.0

1st Q uartile 47.771

Median 60.322

3rd Q uartile 65.353

Maximum 74.827

55.072 58.582

59.175 61.782

11.173 13.669

A -Squared 6.16

P-V alue < 0.005

Mean 56.827

StDev 12.295

V ariance 151.165

Skewness -0.722708

Kurtosis -0.595411

N 191

Minimum 25.921

A nderson-Darling Normality Test

95% C onfidence Interv al for Mean

95% C onfidence Interv al for Median

95% C onfidence Interv al for StDev95% Confidence Intervals

Summary for DALE

c. A scatter plot of DALE against GDPC appears below.

The figure shows two well known results. First, very definitely higher per capita income is

associated with higher life expectancy. Second, there is a plateau. Expected life spans peak at

about 72 years.

d. We (as educators) certainly hope that higher incomes are associated with higher amounts of

education. The figure does seem to be consistent with this observation. A low level of education

is certainly bad, but a high level of education does not guarantee economic success.

Assignment 1

9

Part II. Unconditional and Conditional Probability

6. An icosahedron is a regular geometric figure that has 20 geometrically identical faces.These

can be marked with numbersand used like dice. Suppose that you have one of these, and that it is

marked with theintegers 1, 2, 3, ..., 20. If you roll this object exactly once, find

a. the probability that the top face has an even number;

b. the probability that the top face has a number greater than 7;

c. the probability that the top face has a number at most 4;

d. the probability that the top face has a number less than 4.

SOLUTION:

(a) This is 10/20 = 1/2 = 0.50.

(b) The description “greater than 7” here refers to the set {8, 9, 10, 11, …, 20}. The probability is

13/20 = 0.65

(c) The description “at most 4” refers to {1, 2, 3, 4}. The probability is 4/20 = 1/5 = 0.20.

(d) The description “less than 4” refers to {1, 2, 3}, and the probability is 3/20 = 0.15

7. In a fit of boredom, Stanley decides that he will flip a coin 80 times.

a. What is the probability that the first flip will be heads?

b. What is the probability that the first flip will be tails?

c. What is the probability that the second flip will be heads?

d. What is the probability that the 43rd flip will be tails?

e. What is the probability that the total number of heads, out of 80 flips, will be aneven

number?

SOLUTION: For each of parts (a) through (d), the solution is clearly 1/2. For part (e), imagine

that you’ve done the first 79 flips. At that point, you either will have an even number of heads or

an odd number of heads. If the first 79 flips gave you an even number of heads, the probability is

1/2 that the final flip (by getting tails) will leave you with an even number. If the first 79 flips

save you an odd number of heads, the probability is 1/2 that the final flip (by getting heads) will

Assignment 1

10

bring you back to an even number. Thus, the probability that 80 flips will result in an even

number of heads is 1/2.

NOTES: The fact that the coin is fair, meaning P(heads) = P(tails) = 1/2, is critical to this

solution. For a fair coin, the probability of getting exactly 40 heads in 80 flips is 0.0889, about

9%.

8. A secretary has left you four index cards involving phone messages. You will have to call back

Johnson, Ortega, Green, and Baker. If you shuffle these cards into random order,

a. What is the probability that the names will end up in alphabetical order, starting with

Baker?

b. What is the probability that Johnson will precede Green?

This should be interpreted as having Johnson anywhere before Green,

meaning either one turn ahead or two turns ahead or three turns ahead.

c. What is the probability that Johnson and Green will be on consecutive cards?

SOLUTION: There are 4! = 24 possible orders, and all are equally likely.

(a) The probability that you will end up with order Baker-Green-Johnson-Ortega is then 1/24.

(b) The probability is 1/2. Either Johnson precedes Green or Green precedes Johnson. These two

descriptions are symmetric, so they must have equal probabilities.

(c) There are many ways to do this. One reliable method just lists all the orders: BGJO* GBJO JBGO OBGJ*

BGOJ GBOJ JBOG OBJG*

BJGO* GJBO* JGBO* OGBJ

BJOG GJOB* JGOB* OGJB*

BOGJ* GOBJ JOBG OJBG

BOJG* GOJB JOGB OJGB*

The lines marked with * have J and G consecutive. There are 12 such cases, so the probability is

12/24 = 1/2.

Here’s a method that is less clerically intensive. There are 4! = 24 possible orders. Let’s

count up the number of orders in which J and G are consecutive. Think of three symbols B, X, O,

where the X represents (J, G) together. There are 3! = 6 orderings of these three symbols. The

count must however be doubled to account for both JG and GJ. Then there are 2 × 6 = 12 orders

in which J and G are consecutive. The probability is then 12/24 = 1/2, as above. This approach

makes it easy to generalize to any number of cards. Suppose that you have n cards, randomly

shuffled, and you want to know that probability that any particular two of them will be together.

This probability is 2( -1)! 2( -1)! 2

= =! ( -1)!

n n

n n n n. The case n = 4 is illustrated by this problem.

9. Liz Waters, the manager of Food City Supermarket, has all sorts of data on the

store’scomputer. The customers use “frequent shopper” scan tags at each visit, and most of

theitems are priced by scanning bar codes. Liz is exploring whether coupons can be used toentice

consumers to change brands, in particular, will a coupon for Tide detergent causeconsumers to

purchase Tide. Over the four-week study period, each customer who buysany detergent is given a

coupon for $1 off the next purchase of Tide. The display belowis limited to those customers who,

during the four-week study period

* purchased detergent during the first week

* visited the store a second time during the four weeks and made a purchaseof at least $25

A summary of the transactions from the study:

Assignment 1

11

There were no instances in which a customer bought two different detergent brands on the same

visit.Suppose that one of the customers is selected at random. Find

a. the probability that the customer bought Tide in week 1.

b. the probability that the customer bought Tide in week 1 and also purchased Tide again

during the study period.

c. the conditional probability that the customer bought Tide at a subsequent visit, given

that he or she bought Tide in week 1.

d. the conditional probability that the customer bought Tide at a subsequent visit, given

that he or she bought a non-Tide detergent in week 1.

SOLUTION:

(a) Since 102 of the customers are in this category, the probability is 102/350 ≈ 0.2914.

(b) This probability is 31/350 ≈ 0.0886 ≈ 9%.

(c) This is 31/102 ≈ 0.3039 ≈ 30%.

(d) This probability is 38/248 ≈ 0.1532 ≈ 15%.

This would seem to show that people who bought Tide the first time (without a coupon) were

relatively eager, at 30%, to use the coupon to purchase Tide again. About 15% of the non-Tide

customers used the coupon to buy Tide. This makes an interesting statement about brand loyalty,

but it says absolutely nothing about whether Tide is a good product.

10. Suppose that, hypothetically, 88% of all people being tried for burglary are in fact guilty of

the crime. Suppose that 6% of innocent people are convicted at trial and that74% of guilty people

are convicted at trial.

a. If a person is convicted at trial, what is the probability that the person really is guilty

of the crime?

b. If a person is acquitted at trial, what is the probability that the person really is innocent

of the crime?

SOLUTION: Let G indicate guilty, ~G indicate not guilty (innocent), C = convicted and ~C =

not convicted (acquitted). The “facts of the case” give us directly P(G) = .88 so P(~G) = 1 - .88 =

.12. We are also given P(C|~G) = .06 and P(C|G) = .74. The first question asks for P(G|C). The

direct solution uses Bayes theorem.

P(G|C) = P(G and C)/P(C) = [P(C|G)P(G)]/[P(C|G)P(G) + P(C|~G)P(~G)]

= .74(.88)/[.74(.88) + .06(.12)] = .9891.

The second question asks for P(~G|~C). Once again the solution is easily found using Bayes

Theorem. We are given P(C|~G) = .06, so P(~C|~G) = 1 - .06 = .94. Also, we are given P(C|G) =

.74 so P(~C|G) = 1 - .74 = .26. Then, using the theorem again,

P(~G|~C) = [P(~C|~G)P(~G)] / [P(~C|~G)P(~G) + P(~C|G)P(G)]

= [.94(.12)]/[.94(.12) + .26(.88)] = .3302.

Assignment 1

12

There are other ways to solve this sort of problem. A “hypothetical hundred thousand” method

also works easily. Just imagine 100,000 (or any other convenient large number) cases,

apportioned to the given proportion:

Of the 12,000 innocent, there will be (in the expected value sense) 0.06 × 12,000 = 720 who are

convicted. Of the 88,000 guilty, there will be 0.74 × 88,000 = 65,120 convicted. Place this

information into the table:

Now fill in across the rest of the details:

(a) There are 65,840 convictions here, and 65,120 are in fact guilty of the crime. Thus, the

conditional probability of guilt, given conviction, is 65,120/65,840 ≈ 0.9891. This is just under

99%.

(b) There are 34,160 acquittals. Of these, 11,280 are innocent. The conditional

probability of innocence, given acquittal, is 11,280/34,160 ≈ 0.3302. This is about 33%.

Assignment 1

13

11. The customer service office of Garsett Bank receives complaints regarding transactions at its

two off-site ATMs. We’ll identify these sites as A and B. We know the following:

Site A generates 70% of all the ATM activity, and site B generates 30%.

The proportion of transactions that lead to a complaint is 0.006.

Among the complaints received by the customer service office, 45% are related to site B.

Based on these facts, find the site-specific complaint rates. That is find P(complaint|A)and

P(complaint | B).

SOLUTION: This is again a Bayes’ formula activity, but it’s a bit convoluted. Let’s set

up the “hypothetical hundred thousand” layout:

This shows 70% of the transactions coming from site A. We are told that 0.006 of the

transactions lead to complaints, and so we can update the table to this:

However, 45% of the complaints are traced to site B. Noting that 0.45 × 600 = 270, we

update our table as follows:

At this point, it becomes clear that P(complaint | A) = 330/70,000 ≈ 0.0047 and that P(complaint |

B) = 270/30,000 = 0.0090.

Part III. Expected Value, Covariance and Correlation

12. (This is Exercise HOG 4.11, page 148.) An investment syndicate is trying to decide which of

two $2,000,000 apartment houses to buy. An advisor estimates the following probabilities for the

two-year net returns (in thousands of dollars):

Return: -50 0 50 100 150 200 250

Probability for house 1: .02 .03 .20 .50 .20 .03 .02

Probability for house 2: .15 .10 .10 .10 .30 .20 .05

a. Calculate the expected net return for house 1 and for house 2.

b. Calculate the respective variances and standard deviations.

SOLUTION:

(a) The expected net return for house 1 is (-50) × 0.02 + 0 × 0.03 + … + 250 × 0.02 = 100. This

should have been obvious immediately, as the probabilities are symmetric about the center value

of 100. For house 2, this is (-50) × 0.15 + 0 × 0.10 + … + 250 × 0.05 = 105. Thus house 2 has the

better expected value (but not by much).

Assignment 1

14

(b) For house 1, we can calculate the expected squared return as (-50)2 × 0.02 + 02 × 0.03 + … +

2502 × 0.02 = 12,500. The variance would then be 12,500 - 1002 = 2,500, and the standard

deviation would then be the square root of 2,500 = 50. This could have also been done as the

expected squared deviation from average. The arithmetic is this:

(-50 - 100)2 × 0.02 + (0 - 100)2 × 0.03 + … + (250 - 100)2 × 0.02 = 2,500

For house 2, the expected squared return is

(-50)2 × 0.15 + 02 × 0.10 + … + 2502 × 0.05 = 19,500

This leads to a variance of 19,500 - 1052 = 8,475, with the standard deviation coming out as

8475 ≈ 92.0598.

This also could have done around the average value 105, thus:

(-50 - 105)2 × 0.15 + (0 - 105)2 × 0.10 + … + (250 - 105)2 × 0.05 = 8,475

You can see that house 2 has a better expected value, but the much larger standard deviation

indicates that it is considerably more risky

13. The following (completely fictitious) table shows the probabilities of music CD sales per

minute of Tower Records in a given month (back when there was a Tower Records, and back

when people actually bought music CDs) associated with the random distribution of the number

of concerts scheduled in that month. It appears that the two random variables may be correlated.

The following investigates. (Note that the zeros at certain points in the table are by themselves

suggestive.) We note that the values in the table are retrospective – we have simply tabulated

observed results over a long history. This would be in contrast to the case in which we examined

specific months in which there were, say, 0 concerts scheduled, or 1, in which the values in the

table would be estimates of conditional probabilities, not joint probabilities.

a. What is the expected number of CD sales per minute?

b. What is the variance of the number of concerts per month?

c. Before doing the calculation, what sign do you expect for the covariance of the two

variables?

d. What is the covariance of the CD sales and number of concerts?

e. What is the correlation of the two variables.

f. What is the expected number of CD sales per minute in a month in which there are two

major concerts scheduled. Same for three major concerts. Are the two values the

same?

CDs Sold Per Minute

Concerts

per

Month

0 1 2 3 4 Total

0 .02 .05 .04 .01 .00 .12

1 .02 .04 .06 .03 .00 .15

2 .01 .03 .30 .04 .02 .40

3 .00 .03 .10 .12 .08 .33

Total .05 .15 .50 .20 .10 1.00

SOLUTION:

(a) E[CD Sales] = 0(.05) + 1(.15) + 2(.50) + 3(.20) + 4(.10) = 2.15

(b) E[Concerts] = 0(.12)+1(.15) + 2(.40)+3(.33) = 1.94. So, the variance is

Var[Concerts] = 02(.05) + 12(.15) + 22(.40) + 32(.33) – 1.942 = 0.9564.

(c) It appears that as Concerts increases, higher numbers of CD sales become more likely, so I

would guess that the covariance is positive.

(d) The covariance is ΣConcerts ΣCD (Concerts)(CDs)P(Concert,Cd) - µConcerts µCDs

= 4.64 – 1.94(2.15) = 0.469

Assignment 1

15

(e) We need to divide the covariance by the product of the two standard deviations. We found

SD(Concerts) above as the square root of 0.9564. The standard deviation of CDs is the square

root of

02(.05) + 12(.15) + 22(.50) + 32(.20) + 42(.10) – 2.052 = 5.55 – 2.152 =.9275. So, the correlation is

ρ = 0.469 / .9564 .9275 = 0.4980

(f) The conditional probabilities are P(CD=0|C=2)=.01/.40=.025; P(CD=1|C=2)=.03/.40=.075;

P(CD=2|C=2)=.30/.40=.75; P(CD=3|C=2)=.04/.40=.10; P(CD=4|C=2)=.02/.40=.05. With these,

the expected value is 0(.025)+1(.075)+2(.75)+3(.10)+4(.05)=2.075.

Expected Value and Variance

Discussed in class. See slides at the end of session 5.

statistics and data analysis - new york...

Documents