statistics and data analysis - new york...
TRANSCRIPT
Assignment 1
1
Statistics and Data Analysis
Professor William Greene Phone: 212.998.0876 Office: KMC 7-90 Home page: http://people.stern.nyu.edu/wgreene Email: [email protected] Course web page: http://people.stern.nyu.edu/wgreene/Statistics/Outline.htm
Assignment 1 Solutions
Notes: (1) The data sets for this problem set (and for the other problem sets for this course) are all stored on the home page for this course. You can find links to all of them on the course outline, at the bottom with the links to the problem sets themselves.
(2) In the exercises below (and in the other problem sets), the initials HOG refer to the textbook Basic Statistical Ideas for Managers, by Hildebrand, Ott and Gray.
Part I. Describing Data 1. Consider the following values: 20 11 14 12 17 14 10 23 15 11 17 10 18 18 13 18
Find the mean, median, and mode for these data.
SOLUTION: There are 16 values, and these 16 values have a total of 241, so the mean (or
average) is 241 ÷ 16 = 15.0625. If you sort the values into ascending order, you get
10 10 11 11 12 13 14 14 15 17 17 18 18 18 20 23
↑ The median of the 16 values occurs between the 8th and 9th largest, at the position indicated by
the arrow. We give 14.5 as the median. The value 18 occurs three times, more often than any
other value, so we report 18 as the mode.
2. This is Exercise HOG 2.1, page 23. The data are available on the course website as HOG-Ex0201.mpj.An automobile manufacturer routinely keeps records on the number of finished
(passing all inspections) cars produced per eight-hour shift. The data for the last 28 shifts are 366 390 324 385 380 375 384 383 375 339
360 386 387 384 379 386 374 366 377 385
381 359 363 371 379 385 367 364
a. What is the average number of finished cars per shift based on these data?
b. Construct a histogram for these data.
IOMS Department
Assignment 1
2
c. The data above are the results from observing 28 eight hour shifts. You are about to observe a
29th. What would be a good guess of how many will be observed?
d. Suppose the 29th shift was expected to be a very productive one – with large output. What
would be a good guess of the number of finished cars on a very good day?
.a. 373.4
.b. SOLUTION: There are many ways to make a histogram, including by hand. Here’s the default
from Minitab:
Minitab selected intervals of width 10 and then centered them on the labels. It might look nicer to
have the bars between cutpoints:
This was achieved by clicking on the horizontal scale, then on the binning tab, selecting
Cutpoint, and choosing Midpoint/cutpoint positions as 320:400/10. (This is an instance of the
Minitab convention start:end/step.)
If you drew this by hand, you might have selected a different strategy, but the general appearance
should be similar. We’re asked for an explanation of this shape. We’re going to have to come up
with an extra-statistical argument, meaning a statement that goes beyond what the data are telling
us directly. One such statement is that the factory regularly produces automobiles in the 350-400
range, but every now and then something goes very wrong, with the production dropping by
about 40 to 50 cars. The data tell us directly that most of the time the production is in the range
350-400, but now and then the production drops to 320-340. The data do not tell us the reason
why this drop occurs; any such explanation goes beyond the statistical information.
.c. The mean of 373.4 would be a good guess. The median of 378 would be also.
Assignment 1
3
.d. The maximum of 390 might be tempting. But, it is hard to expect a randomly chosen day to
be as good as the best day in a sample, especially as the number of observations gets large. It
would be good to move back from the extreme point. A slightly lower value such as 386 or 387
might be a better guess.
3. Which of the two samples in each set has the higher standarddeviation. You can tell by
looking at the data. It is not necessary to do any computation to answer this question. Explain
your reasoning for each answer.
Set 1 Sample A: 16, 16, 16, 16, 16
Sample B: 15, 16, 16, 16, 16
Set 2 Sample A: 20, 25, 25, 25, 30
Sample B: 15, 25, 25, 25, 35
Set 3 Sample A: 20, 20, 30, 40, 40
Sample B: 20, 25, 30, 35, 40
Set 1. S.D. of Sample A is 0.0. It is something greater than 0 for sample B
Set 2. Sample B is the same as A in the middle, but the leftmost observation is smaller and the
rightmost observartion is larger than in A. So sample B has the larger standard deviation.
Set 3. Observations 1,3,5 are the same in both. The means are the same, 35. Observations 1,3,5
are the same, observations 2 and 4 contribute (20-30)2 and (40-30)2 = 200 in A. Observations 2
and 4 contribute (25-30)2 and (35-302 = 50 in B, so A has the larger standard deviation.
4. This is exercise HOG, problem 2.23, page 42. The data file is HOG-Ex0222.mpj on the
course outline. Data on 60 telephone operators in terms of number of call requests processed in a
workdaywere analyzed using Minitab.
Descriptive Statistics: Cleared Variable N Mean SE Mean StDev
Cleared 60 794.23 4.42 34.25
Minimum Q1 Median Q3 Maximum
601.00 789.00 799.00 807.75 844.00
Data Display Cleared
797 794 817 813 817 793 762 719 804 811 837 804 790
796 807 801 805 811 835 787 800 771 794 805 797 724
820 601 817 801 798 797 788 802 792 779 803 807 789
787 794 792 786 808 808 844 790 763 784 739 805 817
804 807 800 785 796 789 842 829
a. Calculate the “mean plus-or-minus 1 standard deviation” interval used in theEmpirical Rule
discussed in class.
b. Of the 60 scores in “cleared,” 51 fall within the 1 standard deviation interval. How does this
result compare with the theoretical value of the Empirical Rule?
SOLUTION: The interval x ± s is here 794.23 ± 34.25, or 759.98 to 824.48. The fraction falling
within this interval is 51/60 = 85%. This is much larger than the expected 67%, so something
must have inflated the standard deviation. The lonely single value 601 is likely the problem. If
you set aside this single value, you’ll find that the standard deviation of the remaining 59 values
is 23.21, a huge reduction from the overall standard deviation of 34.25. In general, we do not
recommend that you set aside values just because they are weird.
Assignment 1
4
5. (Application) The data file WHO-HealthStudy.mpj (a Minitab project file) contains a
famous data set. These data were used in the World Health Organization’s 2000 comparison of
the health care systems in 191 countries – nearly the entire world – that was widely discussed in
the popular press (including on the front page of the New York Times) If you’ve seen Michael
Moore’s movie Sicko, or seen the trailer, there is a point at which he takes out a study on a
clipboard and shows you how the United States ranked 37th in the world in “health care.” These
are part of the data that were used to do the study. The extract of the data file in the Minitab
project contains 12data columns, as shown below for the first few countries
Assignment 1
5
The variables in the file are
DALE = disability adjusted life expectancy
EDUC = average years of education
GINI = a measure of income inequality (low numbers are bad)
POPDEN = the population density, people per square kilometer
GDPC = per capita gross domestic product (country income)
GEFF = World Bank measure of the effectiveness of the government
VOICE = World Bank indicator of how democratic the country is
OECD = an indicator of whether the country is in the OECD. (OECD is the
United Nations Organization for Economic Cooperation and
Development. Notwithstanding its lofty title, it is mainly a group of
the world’s wealthiest countries.)
COMP = an equally weighted average of survey results on five objectives
(Health, Health distribution, Responsiveness, Responsiveness in
Distribution, Fairness in financing).
EFFICIENCY = estimated overall efficiency (WHO/Paper 30, based on COMP)
PCHEXP = per capita health care expenditure (public)
PUBSHARE = proportion of total health expenditure paid for by the government
a. Let’s compare the incomes of the 30 OECD countries with the incomes of the 161 other
countries. A box-plot will be useful. Use Graph -> Boxplot (with groups), then graph
variable GDPC with group variable OECD. Note that OECD = 1 is the OECD and OECD =
0 is the other countries. What do you find?
b. Present a description of the variables DALE and GDPC using the tools discussed in class.
(Descriptive statistics will include means and standard deviations and medians. Graphical
tools include histograms and box plots.)
c. Does higher income buy higher life expectancy? Produce a scatter plot of DALE (on the Y
axis) against GDPC (on the X axis). What do you find?
d. Does education produce higher income? Produce a scatter plot of EDUC (on the X axis
against GDPC. What do you find? What conclusion do you draw?
e. Do higher levels of education appear to be associated with higher life expectancy?
SOLUTION a. The figure tells the story
Assignment 1
6
b. The full set of descriptive statistics is given below.
The two variables have different scales, so we cannot plot them on the same figure (at least not
reasonably.) Here is the box plot of the two together, for example.
Da
ta
GDPCDALE
30000
25000
20000
15000
10000
5000
0
Boxplot of DALE, GDPC
Here is a possibly informative histogram of DALE.
Assignment 1
7
The boxplot of DALE provides only five benchmarks (minimum, Q1, median, Q3, maximum).
We had the number from Descriptive Statistics, and the boxplot makes these into a visual display:
DA
LE
80
70
60
50
40
30
20
Boxplot of DALE
Minitab provides a nice summary for a variable, using Stat -> Basic Statistics -> Graphical
Summary
3000024000180001200060000
Median
Mean
800070006000500040003000
1st Q uartile 1607.1
Median 3445.9
3rd Q uartile 8476.8
Maximum 30264.3
5586.8 7632.0
2812.0 4347.7
6511.2 7965.5
A -Squared 17.28
P-V alue < 0.005
Mean 6609.4
StDev 7164.8
V ariance 51334756.9
Skewness 1.48158
Kurtosis 1.07107
N 191
Minimum 521.5
A nderson-Darling Normality Test
95% C onfidence Interv al for Mean
95% C onfidence Interv al for Median
95% C onfidence Interv al for StDev95% Confidence Intervals
Summary for GDPC
Assignment 1
8
The income variable shows the familiar skewness pattern. The skewness measure that we
discussed in class is given above as 1.458, which is fairly large. Later in the semester, we will
discuss how to decide if this is statistically “large.”
Here is a similar picture for DALE
75.067.560.052.545.037.530.0
Median
Mean
62.560.057.555.0
1st Q uartile 47.771
Median 60.322
3rd Q uartile 65.353
Maximum 74.827
55.072 58.582
59.175 61.782
11.173 13.669
A -Squared 6.16
P-V alue < 0.005
Mean 56.827
StDev 12.295
V ariance 151.165
Skewness -0.722708
Kurtosis -0.595411
N 191
Minimum 25.921
A nderson-Darling Normality Test
95% C onfidence Interv al for Mean
95% C onfidence Interv al for Median
95% C onfidence Interv al for StDev95% Confidence Intervals
Summary for DALE
c. A scatter plot of DALE against GDPC appears below.
The figure shows two well known results. First, very definitely higher per capita income is
associated with higher life expectancy. Second, there is a plateau. Expected life spans peak at
about 72 years.
d. We (as educators) certainly hope that higher incomes are associated with higher amounts of
education. The figure does seem to be consistent with this observation. A low level of education
is certainly bad, but a high level of education does not guarantee economic success.
Assignment 1
9
Part II. Unconditional and Conditional Probability
6. An icosahedron is a regular geometric figure that has 20 geometrically identical faces.These
can be marked with numbersand used like dice. Suppose that you have one of these, and that it is
marked with theintegers 1, 2, 3, ..., 20. If you roll this object exactly once, find
a. the probability that the top face has an even number;
b. the probability that the top face has a number greater than 7;
c. the probability that the top face has a number at most 4;
d. the probability that the top face has a number less than 4.
SOLUTION:
(a) This is 10/20 = 1/2 = 0.50.
(b) The description “greater than 7” here refers to the set {8, 9, 10, 11, …, 20}. The probability is
13/20 = 0.65
(c) The description “at most 4” refers to {1, 2, 3, 4}. The probability is 4/20 = 1/5 = 0.20.
(d) The description “less than 4” refers to {1, 2, 3}, and the probability is 3/20 = 0.15
7. In a fit of boredom, Stanley decides that he will flip a coin 80 times.
a. What is the probability that the first flip will be heads?
b. What is the probability that the first flip will be tails?
c. What is the probability that the second flip will be heads?
d. What is the probability that the 43rd flip will be tails?
e. What is the probability that the total number of heads, out of 80 flips, will be aneven
number?
SOLUTION: For each of parts (a) through (d), the solution is clearly 1/2. For part (e), imagine
that you’ve done the first 79 flips. At that point, you either will have an even number of heads or
an odd number of heads. If the first 79 flips gave you an even number of heads, the probability is
1/2 that the final flip (by getting tails) will leave you with an even number. If the first 79 flips
save you an odd number of heads, the probability is 1/2 that the final flip (by getting heads) will
Assignment 1
10
bring you back to an even number. Thus, the probability that 80 flips will result in an even
number of heads is 1/2.
NOTES: The fact that the coin is fair, meaning P(heads) = P(tails) = 1/2, is critical to this
solution. For a fair coin, the probability of getting exactly 40 heads in 80 flips is 0.0889, about
9%.
8. A secretary has left you four index cards involving phone messages. You will have to call back
Johnson, Ortega, Green, and Baker. If you shuffle these cards into random order,
a. What is the probability that the names will end up in alphabetical order, starting with
Baker?
b. What is the probability that Johnson will precede Green?
This should be interpreted as having Johnson anywhere before Green,
meaning either one turn ahead or two turns ahead or three turns ahead.
c. What is the probability that Johnson and Green will be on consecutive cards?
SOLUTION: There are 4! = 24 possible orders, and all are equally likely.
(a) The probability that you will end up with order Baker-Green-Johnson-Ortega is then 1/24.
(b) The probability is 1/2. Either Johnson precedes Green or Green precedes Johnson. These two
descriptions are symmetric, so they must have equal probabilities.
(c) There are many ways to do this. One reliable method just lists all the orders: BGJO* GBJO JBGO OBGJ*
BGOJ GBOJ JBOG OBJG*
BJGO* GJBO* JGBO* OGBJ
BJOG GJOB* JGOB* OGJB*
BOGJ* GOBJ JOBG OJBG
BOJG* GOJB JOGB OJGB*
The lines marked with * have J and G consecutive. There are 12 such cases, so the probability is
12/24 = 1/2.
Here’s a method that is less clerically intensive. There are 4! = 24 possible orders. Let’s
count up the number of orders in which J and G are consecutive. Think of three symbols B, X, O,
where the X represents (J, G) together. There are 3! = 6 orderings of these three symbols. The
count must however be doubled to account for both JG and GJ. Then there are 2 × 6 = 12 orders
in which J and G are consecutive. The probability is then 12/24 = 1/2, as above. This approach
makes it easy to generalize to any number of cards. Suppose that you have n cards, randomly
shuffled, and you want to know that probability that any particular two of them will be together.
This probability is 2( -1)! 2( -1)! 2
= =! ( -1)!
n n
n n n n. The case n = 4 is illustrated by this problem.
9. Liz Waters, the manager of Food City Supermarket, has all sorts of data on the
store’scomputer. The customers use “frequent shopper” scan tags at each visit, and most of
theitems are priced by scanning bar codes. Liz is exploring whether coupons can be used toentice
consumers to change brands, in particular, will a coupon for Tide detergent causeconsumers to
purchase Tide. Over the four-week study period, each customer who buysany detergent is given a
coupon for $1 off the next purchase of Tide. The display belowis limited to those customers who,
during the four-week study period
* purchased detergent during the first week
* visited the store a second time during the four weeks and made a purchaseof at least $25
A summary of the transactions from the study:
Assignment 1
11
There were no instances in which a customer bought two different detergent brands on the same
visit.Suppose that one of the customers is selected at random. Find
a. the probability that the customer bought Tide in week 1.
b. the probability that the customer bought Tide in week 1 and also purchased Tide again
during the study period.
c. the conditional probability that the customer bought Tide at a subsequent visit, given
that he or she bought Tide in week 1.
d. the conditional probability that the customer bought Tide at a subsequent visit, given
that he or she bought a non-Tide detergent in week 1.
SOLUTION:
(a) Since 102 of the customers are in this category, the probability is 102/350 ≈ 0.2914.
(b) This probability is 31/350 ≈ 0.0886 ≈ 9%.
(c) This is 31/102 ≈ 0.3039 ≈ 30%.
(d) This probability is 38/248 ≈ 0.1532 ≈ 15%.
This would seem to show that people who bought Tide the first time (without a coupon) were
relatively eager, at 30%, to use the coupon to purchase Tide again. About 15% of the non-Tide
customers used the coupon to buy Tide. This makes an interesting statement about brand loyalty,
but it says absolutely nothing about whether Tide is a good product.
10. Suppose that, hypothetically, 88% of all people being tried for burglary are in fact guilty of
the crime. Suppose that 6% of innocent people are convicted at trial and that74% of guilty people
are convicted at trial.
a. If a person is convicted at trial, what is the probability that the person really is guilty
of the crime?
b. If a person is acquitted at trial, what is the probability that the person really is innocent
of the crime?
SOLUTION: Let G indicate guilty, ~G indicate not guilty (innocent), C = convicted and ~C =
not convicted (acquitted). The “facts of the case” give us directly P(G) = .88 so P(~G) = 1 - .88 =
.12. We are also given P(C|~G) = .06 and P(C|G) = .74. The first question asks for P(G|C). The
direct solution uses Bayes theorem.
P(G|C) = P(G and C)/P(C) = [P(C|G)P(G)]/[P(C|G)P(G) + P(C|~G)P(~G)]
= .74(.88)/[.74(.88) + .06(.12)] = .9891.
The second question asks for P(~G|~C). Once again the solution is easily found using Bayes
Theorem. We are given P(C|~G) = .06, so P(~C|~G) = 1 - .06 = .94. Also, we are given P(C|G) =
.74 so P(~C|G) = 1 - .74 = .26. Then, using the theorem again,
P(~G|~C) = [P(~C|~G)P(~G)] / [P(~C|~G)P(~G) + P(~C|G)P(G)]
= [.94(.12)]/[.94(.12) + .26(.88)] = .3302.
Assignment 1
12
There are other ways to solve this sort of problem. A “hypothetical hundred thousand” method
also works easily. Just imagine 100,000 (or any other convenient large number) cases,
apportioned to the given proportion:
Of the 12,000 innocent, there will be (in the expected value sense) 0.06 × 12,000 = 720 who are
convicted. Of the 88,000 guilty, there will be 0.74 × 88,000 = 65,120 convicted. Place this
information into the table:
Now fill in across the rest of the details:
(a) There are 65,840 convictions here, and 65,120 are in fact guilty of the crime. Thus, the
conditional probability of guilt, given conviction, is 65,120/65,840 ≈ 0.9891. This is just under
99%.
(b) There are 34,160 acquittals. Of these, 11,280 are innocent. The conditional
probability of innocence, given acquittal, is 11,280/34,160 ≈ 0.3302. This is about 33%.
Assignment 1
13
11. The customer service office of Garsett Bank receives complaints regarding transactions at its
two off-site ATMs. We’ll identify these sites as A and B. We know the following:
Site A generates 70% of all the ATM activity, and site B generates 30%.
The proportion of transactions that lead to a complaint is 0.006.
Among the complaints received by the customer service office, 45% are related to site B.
Based on these facts, find the site-specific complaint rates. That is find P(complaint|A)and
P(complaint | B).
SOLUTION: This is again a Bayes’ formula activity, but it’s a bit convoluted. Let’s set
up the “hypothetical hundred thousand” layout:
This shows 70% of the transactions coming from site A. We are told that 0.006 of the
transactions lead to complaints, and so we can update the table to this:
However, 45% of the complaints are traced to site B. Noting that 0.45 × 600 = 270, we
update our table as follows:
At this point, it becomes clear that P(complaint | A) = 330/70,000 ≈ 0.0047 and that P(complaint |
B) = 270/30,000 = 0.0090.
Part III. Expected Value, Covariance and Correlation
12. (This is Exercise HOG 4.11, page 148.) An investment syndicate is trying to decide which of
two $2,000,000 apartment houses to buy. An advisor estimates the following probabilities for the
two-year net returns (in thousands of dollars):
Return: -50 0 50 100 150 200 250
Probability for house 1: .02 .03 .20 .50 .20 .03 .02
Probability for house 2: .15 .10 .10 .10 .30 .20 .05
a. Calculate the expected net return for house 1 and for house 2.
b. Calculate the respective variances and standard deviations.
SOLUTION:
(a) The expected net return for house 1 is (-50) × 0.02 + 0 × 0.03 + … + 250 × 0.02 = 100. This
should have been obvious immediately, as the probabilities are symmetric about the center value
of 100. For house 2, this is (-50) × 0.15 + 0 × 0.10 + … + 250 × 0.05 = 105. Thus house 2 has the
better expected value (but not by much).
Assignment 1
14
(b) For house 1, we can calculate the expected squared return as (-50)2 × 0.02 + 02 × 0.03 + … +
2502 × 0.02 = 12,500. The variance would then be 12,500 - 1002 = 2,500, and the standard
deviation would then be the square root of 2,500 = 50. This could have also been done as the
expected squared deviation from average. The arithmetic is this:
(-50 - 100)2 × 0.02 + (0 - 100)2 × 0.03 + … + (250 - 100)2 × 0.02 = 2,500
For house 2, the expected squared return is
(-50)2 × 0.15 + 02 × 0.10 + … + 2502 × 0.05 = 19,500
This leads to a variance of 19,500 - 1052 = 8,475, with the standard deviation coming out as
8475 ≈ 92.0598.
This also could have done around the average value 105, thus:
(-50 - 105)2 × 0.15 + (0 - 105)2 × 0.10 + … + (250 - 105)2 × 0.05 = 8,475
You can see that house 2 has a better expected value, but the much larger standard deviation
indicates that it is considerably more risky
13. The following (completely fictitious) table shows the probabilities of music CD sales per
minute of Tower Records in a given month (back when there was a Tower Records, and back
when people actually bought music CDs) associated with the random distribution of the number
of concerts scheduled in that month. It appears that the two random variables may be correlated.
The following investigates. (Note that the zeros at certain points in the table are by themselves
suggestive.) We note that the values in the table are retrospective – we have simply tabulated
observed results over a long history. This would be in contrast to the case in which we examined
specific months in which there were, say, 0 concerts scheduled, or 1, in which the values in the
table would be estimates of conditional probabilities, not joint probabilities.
a. What is the expected number of CD sales per minute?
b. What is the variance of the number of concerts per month?
c. Before doing the calculation, what sign do you expect for the covariance of the two
variables?
d. What is the covariance of the CD sales and number of concerts?
e. What is the correlation of the two variables.
f. What is the expected number of CD sales per minute in a month in which there are two
major concerts scheduled. Same for three major concerts. Are the two values the
same?
CDs Sold Per Minute
Concerts
per
Month
0 1 2 3 4 Total
0 .02 .05 .04 .01 .00 .12
1 .02 .04 .06 .03 .00 .15
2 .01 .03 .30 .04 .02 .40
3 .00 .03 .10 .12 .08 .33
Total .05 .15 .50 .20 .10 1.00
SOLUTION:
(a) E[CD Sales] = 0(.05) + 1(.15) + 2(.50) + 3(.20) + 4(.10) = 2.15
(b) E[Concerts] = 0(.12)+1(.15) + 2(.40)+3(.33) = 1.94. So, the variance is
Var[Concerts] = 02(.05) + 12(.15) + 22(.40) + 32(.33) – 1.942 = 0.9564.
(c) It appears that as Concerts increases, higher numbers of CD sales become more likely, so I
would guess that the covariance is positive.
(d) The covariance is ΣConcerts ΣCD (Concerts)(CDs)P(Concert,Cd) - µConcerts µCDs
= 4.64 – 1.94(2.15) = 0.469
Assignment 1
15
(e) We need to divide the covariance by the product of the two standard deviations. We found
SD(Concerts) above as the square root of 0.9564. The standard deviation of CDs is the square
root of
02(.05) + 12(.15) + 22(.50) + 32(.20) + 42(.10) – 2.052 = 5.55 – 2.152 =.9275. So, the correlation is
ρ = 0.469 / .9564 .9275 = 0.4980
(f) The conditional probabilities are P(CD=0|C=2)=.01/.40=.025; P(CD=1|C=2)=.03/.40=.075;
P(CD=2|C=2)=.30/.40=.75; P(CD=3|C=2)=.04/.40=.10; P(CD=4|C=2)=.02/.40=.05. With these,
the expected value is 0(.025)+1(.075)+2(.75)+3(.10)+4(.05)=2.075.
Expected Value and Variance
Discussed in class. See slides at the end of session 5.