statistics & data analysis

Statistics & Data Analysis

Course Number B01.1305

Course Section 31

Meeting Time Wednesday 6-8:50 pm

CLASS #5

Professor S. D. Balkin -- Feb. 26, 2003

- 2 -

Class #5 Outline

Understand random sampling and systematic bias Derive theoretical distribution of summary statistics Understand the Central Limit Theorem Use a normal probability plot to assess normality


- 3 -

Review of Last Class

Special Distributions• Counting problems• Binomial distribution problems• Normal distribution problems

CHAPTER 6

Random Sampling and Sampling Distributions


- 5 -

Chapter Goals

Explain why in many situations a sample is the only way learn something about a population

Explain the various methods of selecting a sample

Define and construct sampling distribution of sample means

Understand sources of bias or under-representation in data


- 6 -

A Scenario

Its 9:00 AM on Wednesday and your boss sent you and email asking how your firm’s customers would react to a new price discounting program• Your report is due tomorrow• It takes 10 minutes to interview a single customer in your database of

almost 2,000• What will you do????

Draw a sample of the customers• How will you draw the sample?• Need a representative sample

• Does your database hold a representative sample???


- 7 -

Background

Some previous chapters emphasized methods for describing data• Created frequency distributions, computed averages and measures of

dispersion

Started to lay foundation for inference by studying probability• Counting, Binomial, and Normal Distributions

• Probability distributions encompass all possible outcomes of an experiment and the probability associated with each outcome

So far, we’ve learned how to describe something that has already occurred or evaluate something that might occur


- 8 -

How are these similar…

QC department needs to check the tensile strength of steel wire• Five small pieces are selected every 5 hours• Tensile strength of each piece is determined

Marketing needs to determine the sales potential of a new drug named HappyPill. • 452 consumers were asked to try it for a week• Each consumer completed a questionnaire

Polling agency selections 2,000 voters at random and asked their approval rating of the President

In the study of insider trading, 25 CEOs were identified by the SEC and their trades were monitored for three years


- 9 -

Why Sample???

Destructive nature of some tests

Physical Impossibility of checking all items

Cost of studying all items

Adequacy of sample results

Contacting whole population would be too time-consuming


- 10 -

Types of Samples

Cross-sectional: samples are taken from an underlying population at a particular time

Time-series: samples are taken over time from a random process

Enumerative Studies: sampling from a well-defined population

Analytic Studies: look at the results of a random process to predict future behavior


- 11 -

Why Sample???

We often need to know something about a large population.• What is the average income of all Stern students?

It’s often too expensive and time-consuming to examine the entire population

Solution: Choose a small random sample and use the methods of statistical inference to draw conclusions about the population

Sampling lets us dramatically cut the costs of gathering information, but requires care. We need to ensure that the sample is representative of the population of interest

But how can any small sample be completely representative?


- 12 -

Why Sample (cont.)

IT IS IMPORTANT TO REALIZE THAT SOME INFORMATION IS LOST IF WE ONLY EXAMINE A SAMPLE OF THE ENTIRE POPULATION

Why not just use the sample mean in place of μ? For example, suppose that the average income of 100

randomly selected Stern students was = 62,154• Can we conclude that the average income of ALL Stern

students (μ) is 62,154? • Can we conclude that μ > 60,000?

Fortunately, we can use probability theory to understand how the process of taking a random sample will blur the information in a population

But first, we need to understand why and how the information is blurred


- 13 -

Sampling Variability

Although the average income of all Stern Langone students is a fixed number, the average of a sample of 100 students depends on precisely which sample is taken. In other words, the sample mean is subject to “sampling variability”

The problem is that by reporting sample mean alone, we don’t take account of the variability caused by the sampling procedure. If we had polled different students, we might have gotten a different average income

It would be a serious mistake to ignore this sampling variability, and simply assume that the mean income of all students is the same as the average of the 100 incomes given in the sample


- 14 -

Populations and Samples

You are considering opening an Atomic Wings in Bethlehem, PA• POPULATION: All residents• SAMPLE:

• Every 35th person at the mall• Every 2,000th person in the phone book• Every person who leaves Burger King• Don’t forget to include the college students!!!


- 15 -

Choosing a Representative Sample

REPRESENTATIVE: Each characteristic occurs in the same percentage of the time in the sample as in the population

BIAS: Not representative• Bias will exist if there is a systematic tendency to over/under represent

some part of the population

By deliberately not sampling based on any specific characteristic, a randomly selected sample will typically be free from bias

Randomly selecting subjects lets you make probability statements about the results


- 16 -

Examples of Bias

Selection Bias: • A telephone survey of households conducted entirely between 9 a.m.

to 5 p.m.• Using a customer complaint database to query on the new discount

program

Nonresponse Bias: Sample member refuses to participate• Every market research program

Operational Definitions: Guiding a response• Do you agree that taxes are too high in New York


- 17 -

Simple Random Sampling

Process where each possible sample of a given size has the same probability of being selected

Example: IBM reported sales of $64.792 Billion and a net loss of $2.827 Billion for 1991.• The number of individual transactions was enormous• The auditors used statistics because to choose a representative

sample of transactions to check in detail


- 18 -

Choosing a Random Sample

1. Number every member in the population 1…N

2. Use a random process to select the sample R, flipping a coin, random number table…whatever is appropriate In this class we will use the computer


- 19 -

Sampling Statistics and Distributions

Once a sample is drawn, we summarize it with sample statistics

The value of any summary statistic will vary from sample to sample (a big problem…no?)

A sample statistic is itself a random variable• Hence, it has a theoretical probability distribution called the sampling

distribution

We can find the mean and standard deviation of many random samples


- 20 -

Definition

nn

n

n

nYE

Y

)(

are Ymean sample theoferror standard and valueexpected the

,population a fromdrawn isn size of sample random a If


- 21 -

Example

Suppose the long-run average of the number of Medicare claims submitted per week to a regional office is 62,000, and the standard deviation is 7,000. • If we assume that the weekly claims submissions during a 4-week

period constitute a random sample of size 4, what are the expected value and standard error of the average weekly number of claims over a 4-week period?

NOTE: Standard error denotes the theoretically derived standard deviation of the sampling distribution of a statistic.


- 22 -

Standard Error

Standard Deviation of the statistic

Is interpreted just as you would any standard deviation

Indicates approximately how far the observed value of the statistic is from its mean• Literally: it indicated the standard deviation you would find if you took

a very large number of samples, found the sample average for each one, and worked with these sample averages as a data set


- 23 -

Example

Suppose n=200 randomly selected shoppers interviewed in a mall say they plan to spend on an average of $19.42 today with a standard deviation of $8.63• This tells you what shoppers typically plan to spend, and that a typical,

individual shopper plans to spend about $8.63 more or less than this amount

• So far, this is no more that a description of the individuals interviewed

We can say something about the unknown population mean, which is the mean amount that all shoppers in the mall today plan to spend, including those not interviewed.

What is the standard error of the mean?• This tells us the variability when we use the sample average of $19.42,

as an estimate of the unknown population mean


- 24 -

Sampling Distributions for Means and Sums

If a population distribution is Normal, then the sampling distribution of sample means is also Normal

Example: A timber company is planning to harvest 400 trees from a very large stand.• Yield is determined by its diameter• Distribution of diameters is normal with mean 44 inches and standard

deviation of 4 inches• Find the probability that the average diameter of the harvest trees is

between 43.5 and 44.5 inches.


- 25 -

Example

Its OK if each beer isn’t exactly 12 oz so long as the average volume isn’t too low or too high.• In your production facility, you know that the volume of each beer

follows a Normal distribution, has a standard deviation of 0.5 ounces, representing variability about their mean of 12.01 oz.

• Any case (24 beers) that has an average weight per beer less than 11.75 ounces will be rejected.

What fraction of cases will be rejected this way?• First find the mean and standard deviation of the average of n=24

beers


- 26 -

Central Limit Theorem

For any population, the sampling distribution of the sample mean is approximately normal if the sample size is sufficiently large


- 27 -

Simulation Example

Use R to draw 1000 samples each, with sample sizes 4, 10, 30, and 60 from a highly right-skewed distribution having mean and standard deviation both equal to 1.

Display a histogram of the sample means

data=numeric(0)

for (i in 1:1000) data[i] = mean( rexp(4) )

hist(data)

What type of process might follow this distribution???


- 28 -

Example of Use

An agency of the Commerce Department in a certain state wishes to check the accuracy of weights in supermarkets

They decide to weigh 9 packages of ground meat labeled as 1 pound packages

They will investigate any supermarket where the average weight of the packages is less than 15.5 oz

Assuming that the standard deviation of package weights is 0.6 oz, what is the probability they will investigate an honest market?


- 29 -

Normal Probability Plot

Plots actual versus expected values, assuming a normal distribution

• Nearly normal data will plot as a near straight line

• Right-skewed data plot as a curve, with the slope getting steeper as one moves to the right

• Left-skewed data plot as a curve, with the slope getting flatter as one moves to the right

• Symmetric but outlier-prone data plot as an S-shape, with the slope steepest at both sides


- 30 -

R Examples

data = rnorm(1000) ## do not worry about the r*** commandshist(data)qqnorm(data)qqline(data)

data = rexp(1000)hist(data)qqnorm(data)qqline(data)

data = 1-rlnorm(1000)+30hist(data)qqnorm(data)qqline(data)

data = rnorm(1000); data[1]=5; data[2]=7;hist(data)qqnorm(data)qqline(data)

Point and Interval Estimation

Chapter 7


- 32 -

Review

Basic problem of statistical theory is how to infer a population or process value given only sample data

Any sample statistic will vary from sample to sample

Any sample statistic will differ from the true, population value

Must consider random error in sample statistic estimation


- 33 -

Chapter Goals

Summarize sample data• Choosing an estimator

• Unbiased estimator

Constructing confidence intervals for means with known standard deviation

Constructing confidence intervals for proportions

Determining how large a sample is needed

Constructing confidence intervals when standard deviation is not known

Understanding key underlying assumptions underlying confidence interval methods


- 34 -

Reminder: Statistical Inference

Problem of Inferential Statistics:• Make inferences about one or more population parameters based on

observable sample data

Forms of Inference:• Point estimation: single best guess regarding a population parameter• Interval estimation: Specifies a reasonable range for the value of the

parameter• Hypothesis testing: Isolating a particular possible value for the

parameter and testing if this value is plausible given the available data


- 35 -

Point Estimators

Computing a single statistic from the sample data to estimate a population parameter

Choosing a point estimator:• What is the shape of the distribution?• Do you suspect outliers exist?• Plausible choices:

• Mean

• Median

• Mode

• Trimmed Mean


- 36 -

Technical Definitions

estimators unbiased

possible all oferror standardsmallest thehasit if problem particular afor

efficientmost called isestimator An :ESTIMATOR EFFICIENT

equals valueexpected its if

parameter population for the unbiased called is data sample theof

function a is that ˆestimator An :ESTIMATOR UNBIASED

on.distributi sampling al theoretica hasit

thereforeand variablerandom a itself isestimator An .for

estimatepoint a yields that sample random a offunction

a is parameter a of ˆestimator An :ESTIMATOR


- 37 -

Example

I used R to draw 1,000 samples, each of size 30, from a normally distributed population having mean 50 and standard deviation 10.

For each sample the mean and median are computed.

data.mean = numeric(0)

data.median = numeric(0)

for(i in 1:1000) {

data = rnorm(30, mean=50, sd=10)

data.mean[i] = mean(data)

data.median[i] = median(data)

}

Do these statistics appear unbiased?

Which is more efficient?


- 38 -

Expressing Uncertainty

accuracy. complete with estimates that impression

false theleavemay alone of reporting thee,Furthermor

y.reliabilitown itsabout n informatio

no containsit because usefulness limited of is itself,by

Used.parameter theofestimator point a is mean sample The

. size of sample aon based mean

population aabout inferences make to tryingare weSuppose

X

X

X

X

n


- 39 -

Confidence Interval

An interval with random endpoints which contains the parameter of interest (in this case, μ) with a pre-specified probability, denoted by 1 - α.

The confidence interval automatically provides a margin of error to account for the sampling variability of the sample statistic.

Example: A machine is supposed to fill “12 ounce” bottles of Guinness. To see if the machine is working properly, we randomly select 100 bottles recently filled by the machine, and find that the average amount of Guinness is 11.95 ounces. Can we conclude that the machine is not working properly?


- 40 -

No! By simply reporting the sample mean, we are neglecting the fact that the amount of beer varies from bottle to bottle and that the value of the sample mean depends on the luck of the draw

It is possible that a value as low as 11.75 is within the range of natural variability for the sample mean, even if the average amount for all bottles is in fact μ = 12 ounces.

Suppose we know from past experience that the amounts of beer in bottles filled by the machine have a standard deviation of σ = 0.05 ounces.

Since n = 100, we can assume (using the Central Limit Theorem) that the sample mean is normally distributed with mean μ (unknown) and standard error 0.005

What does the Empirical Rule tell us about the average volume of the sample mean?


- 41 -

Why does it work?

X

time theof 95%

here in is XXS

time theof 95%

about here in is

X


- 42 -

Using the Empirical Rule Assuming Normality


- 43 -

Confidence Intervals

“Statistics is never having to say you're certain”.• (Tee shirt, American Statistical Association).

Any sample statistic will vary from sample to sample Point estimates are almost inevitably in error to some

degree Thus, we need to specify a probable range or interval

estimate for the parameter


- 44 -

Confidence Interval

YY zyzy

2/2/

:mean sample theoferror standard

the times valuetable-z a toequal termminus-or-plus a error with sampling

for allow mean, population theof estimatean asmean sample theUsing

KNOWN AND FOR INTERVAL CONFIDENCE )%1(100


- 45 -

Example

An airline needs an estimate of the average number of passengers on a newly scheduled flight

Its experience is that data for the first month of flights are unreliable, but thereafter the passenger load settles down

The mean passenger load is calculated for the first 20 weekdays of the second month after initiation of this particular flight

If the sample mean is 112 and the population standard deviation is assumed to be 25, find a 90% confidence interval for the true, long-run average number of passengers on this flight


- 46 -

Interpretation

The significance level of the confidence interval refers to the process of constructing confidence intervals

Each particular confidence interval either does or does not include the true value of the parameter being estimated

We can’t say that this particular estimate is correct to within the error

So, we say that we have a XX% confidence that the population parameter is contained in the interval

Or…the interval is the result of a process that in the long run has a XX% probability of being correct


- 47 -

Imagine Many Samples

22 23 24

The interval you computed

Missed!Missed!

The population mean = 23.29

Other intervals y

ou

might have computed


- 48 -

Getting Realistic

The population standard deviation is rarely known Usually both the mean and standard deviation must be

estimated from the sample Estimate with s However…with this added source of random errors, we need

to handle this problem using the t-distribution (later on)


- 49 -

Confidence Intervals for Proportions

We can also construct confidence intervals for proportions of successes

Recall that the expected value and standard error for the number of successes in a sample are:

How can we construct a confidence interval for a proportion?

nE /)1(;)ˆ( ˆ


- 50 -

Example

Suppose that in a sample of 2,200 households with one or more television sets, 471 watch a particular network’s show at a given time.

Find a 95% confidence interval for the population proportion of households watching this show.


- 51 -

Example

The 1992 presidential election looked like a very close three-way race at the time when news polls reported that of 1,105 registered voters surveyed:• Perot: 33%• Bush: 31%• Clinton: 28%

Construct a 95% confidence interval for Perot? What is the margin of error? What happened here?


- 52 -

Example

A survey conducted found that out of 800 people, 46% thought that Clinton’s first approved budget represented a major change in the direction of the country.

Another 45% thought it did not represent a major change. Compute a 95% confidence interval for the percent of people

who had a positive response. What is the margin of error? Interpret…


- 53 -

Choosing a Sample Size

Gathering information for a statistical study can be expensive, time consuming, etc.

So…the question of how much information to gather is very important

When considering a confidence interval for a population mean , there are three quantities to consider:

n

z

Y /

2/


- 54 -

Choosing a Sample Size (cont)

Tolerability Width: The margin of acceptable error 3% $10,000

Derive the required sample size using:• Margin of error (tolerability width)• Level of Significance (z-value)• Standard deviation (given, assumed, or calculated)


- 55 -

Example

Union officials are concerned about reports of inferior wages being paid to employees of a company under its jurisdiction

How large a sample is needs to obtain a 90% confidence interval for the population mean hourly wage with width equal to $1.00? Assume that =4.


- 56 -

Example

A direct-mail company must determine its credit policies very carefully. The firm suspects that advertisements in a certain magazine have led to

an excessively high rate of write-offs. The firm wants to establish a 90% confidence interval for this magazine’s

write-off proportion that is accurate to 2.0%• How many accounts must be sampled to guarantee this goal?• If this many accounts are sampled and 10% of the sampled accounts are

determined to be write-offs, what is the resulting 90% confidence interval?• What kind of difference do we see by using an observed proportion over a

conservative guess?


- 57 -

Homework #5

Hildebrand/Ott• 6.4• 6.5• 6.8• 6.16• 6.17• 6.46

• In (a) create a normal probability plot also and interpret

• 7.1• 7.2• 7.14• 7.17• 7.18• 7.20• 7.21• 7.30• Read Chapter 11

Verzani

statistics & data analysis

Documents

population sampling

small random sample

yearswhy sample

underlying population

large population

future behaviorwhy sample

population analytic

entire population solution