12. inference about two populations

7/28/2019 12. Inference About Two Populations

1/79

1

Inference aboutTwo Populations


2/79

2

Introduction

Variety of techniques are presentedwhose objective is to compare twopopulations.

We are interested in:

The difference between two means. The difference between two proportions.


3/79

3

INFERENCE ABOUT THEDIFFERENCE BETWEEN TWO

SAMPLES: INDEPENDENT SAMPLES

POPULATION 1 POPULATION 2

PARAMETERS:1, 21

22

PARAMETERS:2,

Statistics: Statistics:

Sample size: n 1 Sample size: n 2

21 1x , s

22 2x , s


4/79

4

Inference about the Differencebetween Two Means:Independent Samples

Two random samples are drawn from the

two populations of interest.

Because we compare two population

means, we use the statistic 1 2 X X


5/79

5

The Sampling Distribution of 1 2 X X

1 2 X X

1 2 X X

1 2 X X

1 2 X X

1. is normally distributed if the(original) population distributions are normal .

2. is approximately normallydistributed if the (original) population is notnormal, but the samples size is sufficientlylarge (greater than 30).

3. The expected value of is 1 - 2

4. The variance of is 12/n1 + 22/n2


6/79

6

If the sampling distribution of isnormal or approximately normal we canwrite:

Z can be used to build a test statistic or a confidence interval for 1 - 2

21

21

nn

)()xx(Z

21 xx

Making an inference about


7/79

7

21

21

nn

)()xx(Z

Practically, the Z statistic is hardlyused, because the population variancesare not known.

? ?

Instead, we construct a t statistic using thesample variances (S12 and S22).

S22S12t

Making an inference about


8/79

8

Two cases are considered whenproducing the t-statistic.

The two unknown population variances areequal .

The two unknown population variances areno t equ a l .

Making an inference about :

and unknown case


9/79

9

Inference about : Equalvariances

2nns)1n(s)1n(

S21

2

22

2

112

p

Example: s12

= 25; s22

= 30; n1 = 10; n2 = 15. Then,

04347.2821510

)30)(115()25)(110(S2p

Calculate the pooled variance estimate by:

n2 = 15 n

1= 10

2

1S

2

2S

The pooled

varianceestimator


10/79

10


2nns)1n(s)1n(

S21

2

22

2

112

p

Example: s12

= 25; s22

= 30; n1 = 10; n2 = 15. Then,

04347.2821510

)30)(115()25)(110(S2p

Calculate the pooled variance estimate by:

2pS

n2 = 15 n

1= 10

2

1S

2

2S

The pooled

Varianceestimator


11/79

11


Construct the t-statistic as follows:

2nn.f .d

)n1

n1

(s

)()xx(t

21

21

2p

21

Perform a hypothesis testH0: = 0H1: > 0

or < 0 or 0

Build a confidence interval

1 2

21 2 , 2

1 2

1 1( ) ( )

is the confidence level.

n n p x x t s n n

where


12/79

12

EXAMPLE

The statistics obtained from randomsampling are given as

It is thought that 1 < 2. Test the

appropriate hypothesis assumingnormality with = 0.01.

1 1 1

2 2 2

n 8, x 93,s 20

n 9, x 129,s 24


13/79

13

SOLUTION

1 and 2 are unknown t-test

Because s 1 and s 2 are not much differentfrom each other, use equal-variance t-test.H0: 1 = 2

H A: 1 < 2 (or 1 - 2


14/79

14

Decision Rule:Reject H

0if t < -t

0.01,8+9-2=-2.602

Conclusion: Since t = -3.33 < -t 0.01,8+9-2 =-2.602, reject H 0 at =

0.01.

1

2 2 2 22 1 1 2 2

p

2

2

p

1 2

1 2

(

(n 1)s (n 1)s (7)20 (8)24s 494

n n 2 8 9 2x x ) 0 (93 129) 0

t 3.331 11 1 494s8 9n n


15/79

15

Test Statistic for 1- 2 when 1 2 and unknown

Test Statistic:

with the degree of freedom

1 2 1 2

2 21 2

1 2

(x x ) ( )t =

s sn n

2 2 21 1 2 2

2 22 21 1 2 2

1 2

(s / n s / n )

s / n s / n

n 1 n 1


16/79

16

Inference about : Unequal

variancesConduct a hypothesis testas needed, or,build a confidence interval

int

2 21 2

( ) ( )1 2 , 1 2is the confidence level

Confidence erval

s s

x x t 2 n n

where


17/79

17

Which case to use:Equal variance or unequal

variance? Whenever there is insufficient evidence that

the variances are unequal, it is preferable to

perform the equal variances t-test . This is so, because for any two given

samples

The number of degreesof freedom for the equalvariances case

The number of degreesof freedom for the unequalvariances case


18/79

18

Do people who eat high-fiber cereal for breakfast consume, on average, fewer calories for lunch than people who do not

eat high-fiber cereal for breakfast? A sample of 30 people was randomlydrawn. Each person was identified as aconsumer or a non-consumer of high-fiber cereal.

For each person the number of caloriesconsumed at lunch was recorded.

Example: Making an inferenceabout


19/79

19

onsumers on-cmrs568 705498 819589 706681 509540 613646 582636 601739 608539 787596 573607 428529 754

637 741617 628633 537555 748

. .

. .

. .

. .

Solution:

The data are interval.

The parameter to be tested isthe difference between two means.

The claim to be tested is:The mean caloric intake of consumers (1)is less than that of non-consumers ( 2).



20/79

20

The hypotheses are:

H0: ( 1 - 2) = 0

H1: ( 1 - 2) < 0 To check the whether the population variances areequal, we use computer output to find the samplevariances

We have s 12= 1274.49, and s22 = 13,386.49.

It appears that the variances are unequal .



21/79

21


Compute: Manually

From the data we have:

1 2

1 2

595.8; x 661.1

35.7; s 115.7

x

s

2

2 2

2 22 2

35.7 /10 115.7 / 20 25.0135.7 /10 115.7 / 20

10 1 20 1

df


22/79

22


Compute: Manually The rejection region is t < -t , = -t .05,25 @ -1.708

1 2 1 22 2 2 21 2

1 2

(x x ) ( ) (598.8 661.1) 0t = 2.31

s s 35.7 115.7n n 30 30


23/79

23

MINITAB OUTPUT Two Sample T-Test and Confidence Interval

Twosample T for Consumers vs Non-cmrs

N Mean StDev SE MeanConsumers 10 595.8 35.7 11Non-cmrs 20 661 116 26

95% C.I. for mu Consumers - mu Non-cmrs: ( -123, -7)T-Test mu Consmers = mu Non-cmrs (vs


24/79

24

2 21 2( )

1 2 / 2,1 2

4103 10670(604.02 633.239) 1.9796

43 10729.21 27.65 56.86, 1.56

s s x x t

n n

Compute: ManuallyThe confidence interval estimator for thedifference between two means is



25/79

25

An ergonomic chair can be assembledusing two different sets of operations

(Method A and Method B) The operations manager would like to know

whether the assembly time under the two

methods differ.

Example


26/79

26

Example Two samples are randomly and independently

selected

A sample of 25 workers assembled the chair using method A.

A sample of 25 workers assembled the chair using method B.

The assembly times were recorded

Do the assembly times of the two methods differs ?


27/79

27

Example: Making an inference

about Method A Method B

6.8 5.25.0 6.7

7.9 5.75.2 6.67.6 8.55.0 6.55.9 5.95.2 6.7

6.5 6.6. .. .. .. .

Assembly times in Minutes

Solution

The data are interval.

The parameter of interest is the differencebetween two population means.

The claim to be tested is whether a differencebetween the two methods exists.


28/79

28

Solution: Making an inference

about Compute: Manually The hypotheses test is:

H0: ( 1 - 2) 0H1: ( 1 - 2) 0

To check whether the two unknown population variances areequal we calculate S12 and S22 .

We have s 12= 0.8478, and s22 =1.3031.

The two population variances appear to be equal.


29/79

29


about Compute: Manually

4822525.f .d

93.0

251

251

076.1

0)016.6288.6(t

3031.1s 8478.0s 016.6x 288.6x 222121

076.122525

)303.1)(125()848.0)(125(S 2p

To calculate the t-statistic we have:


30/79

30

The rejection region is t < -t / , =-t .025,48 = -2.009or t > t / , = t .025,48 = 2.009

CONCLUSION: Since t = -2.009 < 0.93 < 2.009,there is insufficient evidence to reject the nullhypothesis.

For = 0.05

2.009.093-2.009

Rejection regionRejection region

Solution


31/79

31


about

.3584 > .05

-2.0106 < .93 < +2.0106

t-Test: Two-Sample Assuming Equal Variances

Method A Method B

Mean 6.29 6.02Variance 0.8478 1.3031Observations 25 25Pooled Variance 1.08Hypothesized Mean Difference 0df 48t Stat 0.93P(T


32/79

32

Conclusion: There is no evidence to infer

at the 5% significance level that the twoassembly methods are different in terms of assembly time


about


33/79

33


about A 95% confidence interval for 1 - 2 is calculated as follows:

1 2

2

1 2 , 2

1 2

1 1( ) ( )

1 16.288 6.016 2.0106 1.075( )

25 250.272 0.5896 [ 0.3176, 0.8616]

n n p x x t sn n

Thus, at 95% confidence level -0.3176 < 1 - 2 < 0.8616

Notice: Zero is included in the confidence interval


34/79

34

Checking the required conditions for the equal variances case

The data appear to beapproximately normal

0

2

4

6

8

10

12

5 5.8 6.6 7.4 8.2 More

Design A

01234

567

4.2 5 5.8 6.6 7.4 More

Design B


35/79

35

ANALYSIS OF PAIRED DATA

What is a matched pair experiment?

Why matched pairs experiments are needed?

How do we deal with data produced in this way?

The following example demonstrates a situationwhere a matched pair experiment is the correctapproach to test the difference between twopopulation means.


36/79


37/79

37

Solution Compare two

populations of intervaldata.

The parameter testedis 1 - 2

Finance Marketing61,228 73,36151,836 36,95620,620 63,627

73,356 71,06984,186 40,203. .. .. .

1

2

The mean of the highest salaryoffered to Finance MBAs

The mean of the highest salaryoffered to Marketing MBAs

H0: ( 1 - 2) = 0H1: ( 1 - 2) > 0



38/79

38

Solution continued

From the data we have:

559,228,262s

,294,433,360s

423,60x

624,65x

22

21

2

1

Let us assume equalvariances


Equal VariancesFinance Marketing

Mean 65624 60423Variance 360433294 262228559Observations 25 25Pooled Variance 311330926Hypothesized Mean Difference 0df 48t Stat 1.04P(T


39/79

39

Question The difference between the sample means is

65624 60423 = 5,201. So, why could we not reject H 0 and favor H 1

where ( 1 2 > 0)?

The effect of a large samplevariability


40/79


41/79

41

Reducing the variability

The values each sample consists of might markedly vary...

The range of observationssample B

The range of observationssample A


42/79

42

...but the differences between pairs of observations might be quite close to one another, resulting in a smallvariability of the differences.

0

Differences

The range of thedifferences

Reducing the variability


43/79

43

Analysis of Paired Data

Since the difference of the means isequal to the mean of the differences wecan rewrite the hypotheses in terms of D(the mean of the differences) rather than interms of 1 2.

This formulation has the benefit of asmaller variability.

Group 1 Group 2 Difference10 12 - 215 11 +4

Mean1 =12.5 Mean2 =11.5Mean1 Mean2 = 1 Mean Differences = 1


44/79

44

Analysis of Paired Data

Data are generated from matched pairs notindependent samples.

Let X i and Y i denote the measurements for the i-th subject. Thus, (X

i, Y

i) is a matched pair

observations. Denote D i = Y i-Xi or X i-Yi. If there are n subjects studied, we have

D1, D 2,, D n. Then, n n

2 2i i 2

2 2 Di 1 i 1D

D

D D nDs

D and s sn n 1 n


45/79

45

CONFIDENCE INTERVAL FORD= 1 - 2

A 100(1- C.I. for D= is given by :

For n 30, we can use z instead of t.

DD /2, n-1

sx tn


46/79

46

HYPOTHESIS TESTS FORD= 1 - 2

The test statistic for testing hypothesisabout D is given by

with degree of freedom n-1.

D D

Dxt =s / n

EXAMPLE


47/79

47

EXAMPLE Sample data on attitudes before and

after viewing an informational film.Subject Before After Difference

1 41 46.9 5.9

2 60.3 64.5 4.23 23.9 33.3 9.44 36.2 36 -0.25 52.7 43.5 -9.26 22.5 56.8 34.3

7 67.5 60.7 -6.88 50.3 57.3 79 50.9 65.4 14.5

10 24.6 41.9 17.3

i X i Yi D i=Y i-X i


48/79

48

90% CI for D= 1- 2:

With 90% confidence, the mean attitudemeasurement after viewing the film exceedsthe mean attitude measurement beforeviewing by between 0.36 and 14.92 units.

DD 7.64,s 12.57

D/ 2,n 1

s 12.57D t 7.64 1.833

n 10

t0.05, 9

D 1 20.36 14.92


49/79

49

EXAMPLE

How can we design an experiment toshow which of two types of tires isbetter? Install one type of tire on onewheel and the other on the other (front)wheels. The average tire (lifetime)distance (in 1000s of miles) is:

with a sample difference s.d. of There are a total of n=20 observations

4.55 D X

7.22 D s


50/79

50

SOLUTION

H0: D=0

H A: D>0

Test Statistics:D D

D

x 4.55 0t = 2.82

s / n 7.22 / 20

Rejection H 0 if t>t .05,19 =1.729 ,Conclusion: Reject H 0 at =0.05


51/79

51

EXAMPLE

It is claimed that an industrial safetyprogram is effective in reducing the loss of working hours due to factory accidents.The following data are collectedconcerning the weekly loss of workinghours due to accidents in six plants both

before and after the safety program isinstituted.


52/79

52

Loss of working hours 1 2 3 4 5 6

Before 12 30 15 37 29 15 After 10 29 16 35 26 16

Do the data substantiate the claim?

Use = 0.05 .


53/79

53

ANSWER

This is a matched pair experiment becausesamples from two populations are notindependent.

Loss of working hours Difference 2 1 -1 2 3 -1

1, 1.67, 6 D D x s n


54/79

54

1 denote the average loss of working hours due

to factory accidents before the safety program .

2 denote the average loss of working hours dueto factory accidents after the safety program.

Also let . Then,1 2 D

0 : 0

: 0 D

A D

H

H


55/79

55

Test statistic:

Rejection region: Conclusion: Do not reject H 0 at = 0.05

because . There isnot sufficient evidence to conclude that the

mean loss of working hours due to factoryaccidents reduces after the safetyprogram.

11.47/ 1.67 / 6

D

D

xt s n

, 1 0.05,5 2.015nt t t

0.05,51.47 2.015t t

PAIRED DATA AND TWO


56/79

56

PAIRED DATA AND TWOSAMPLE t PROCEDURE

The two-sample t test is based on theassumption of independence.

In many paired experiments, there is astrong dependence between variables.

I f Ab t th Diff


57/79

57

Inference About the Differenceof Two Population Proportions

Population 1 Population 2

PARAMETERS: p1

PARAMETERS: p2

Statistics: Statistics:

Sample size: n 1 Sample size: n 2

1

p2

p

I f b h diff


58/79

58

Inference about the differencebetween two population

proportions In this section we deal with two populations

whose data are nominal.

For nominal data we compare the populationproportions of the occurrence of a certain event.

Examples Comparing the effectiveness of new drug versus older

one Comparing market share before and after advertising

campaign Comparing defective rates between two machines


59/79

59

Parameter and Statistic

Parameter When the data are nominal, we can only

count the occurrences of a certain event in

the two populations, and calculateproportions.

The parameter is therefore p 1 p2.

Statistic An unbiased estimator of p 1 p2 is

(the difference between the sampleproportions).

1 2 p p


60/79

60

Sample 1Sample size n1 Number of successes x1 Sample proportion

Two random samples are drawn from twopopulations. The number of successes in each sample is

recorded.

The sample proportions are computed.

Sample 2

Sample size n2 Number of successes x2 Sample proportionx

n 1

1

p 1

2

22 n

xp

Sampling Distribution of 1 2

p p


61/79

61

SAMPLING DISTRIBUTION OF

A point estimator of p 1-p 2 is

The sampling distribution of is

if nip i 5 and n i(1-p i) 5, i=1,2.

1 2

p p

1 2

1 2 1 2

x x p p

n n

1 2

p p

1 1 2 21 2 1 2

1 2

p (1 p ) p (1 p ) p p ~ N(p p , )n n


62/79

62

2

22

1

11

2121

)1()1(

)()

(

n p p

n p p

p p p p Z

The z-statistic

Because and are unknown the standard error must be estimated using the sample proportions.The method depends on the null hypothesis

1 p 2 p


63/79

63

Testing the p 1 p2

There are two cases to consider:Case 1:

H0: p1-p2 =0Calculate the pooled proportion

1 2

1 2

x x p

n nThen Then

Case 2:

H0: p1-p2 =D (D is not equal to 0)Do not pool the data

22

2

x p

n1

11

x p

n

1 2

1 2

( ) 01 1

(1 )( )

p p Z

p pn n

2

22

1

11

21

n)p

1(p

n)p

1(p

D)p

p

(Z


64/79

64

EXAMPLE (CASE 1)

A manufacturer claims that compared with hisclosest competitor, fewer of his employeesare union members. Over 318 of his

employees, 117 are unionists. From a sampleof 255 of the competitors labor force, 109 areunion members. Perform a test at = 0.05.

p1: the proportion of the manufacturers

employees that are union members. p2: the proportion of his closest competitors

employees that are union members.


65/79

65

SOLUTIONH

0: p

1- p

2=0

H A: p 1- p 2 < 0

and , so pooled

sample proportion is

Test Statistic:

11

1

x 117 p

n 318 2

22

x 109 p

n 255

1 2

1 2

x x 117 109 p 0.39

n n 318 255

(117 / 318 109 / 255) 0

z 1.45181 1

(0.39)(1 0.39)318 255


66/79

66

Decision Rule: Reject H 0 if z < -z 0.05 =-1.645.

Conclusion: Because z = -1.4518 > -z 0.05 =-1.645, not reject H 0 at =0.05. Manufacturer is wrong.


67/79

67

The marketing manager needs to decidewhich of two new packaging designs toadopt, to help improve sales of hiscompanys soap. A study is performed in two supermarkets:

Brightly-colored packaging is distributed insupermarket 1.

Simple packaging is distributed in supermarket 2.

First design is more expensive, therefore,to befinancially viable it has to outsell the seconddesign.

Testing p 1 p2 (Case 1)


68/79

68

Summary of the experiment results Supermarket 1 - 180 purchasers of Johnson

Brothers soap out of a total

of 904 Supermarket 2 - 155 purchasers of Johnson

Brothers soap out of a total

of 1,038 Use 5% significance level and perform a

test to find which type of packaging touse.



69/79

69

Solution The problem objective is to compare the

population of sales of the two packaging

designs. The data are nominal (Johnson Brothers or

other soap) The hypotheses are

H0: p 1 - p 2 = 0H1: p 1 - p 2 > 0

We identify this application as case 1

Population 1: purchases at supermarket 1Population 2: purchases at supermarket 2



70/79

70


Compute: Manually For a 5% significance level the rejection region is

z > z = z .05 = 1.645

1 2 1 2

( ) ( ) (180 155) (904 1,038) .1725

The pooled proportion is

p x x n n

90.2

038,11

9041

)1725.1(1725.

1493.1991.

11)

1(

)()

(

21

2121

nn p p

p p p p Z

becomes statistic z The

1 2

180 904 .1991, 155 1, 038 .1493

The sample proportions are

p and p


71/79

71

Testing p 1 p2 (Case 1) Excel (Data Analysis Plus)

Conclusion: There is sufficient evidence to conclude at the 5%significance level, that brightly-colored design will outsell thesimple design.

z-Test: Two Proportions

Supermark et 1 Supermark et 2 Sample Proportions 0.1991 0.1493Observations 904 1038Hypothesized Difference 0z Stat 2.90P(Z


72/79

72

The bath soap of Johnson Brother Company is notselling well. Hoping to improve sales, the companysadvertising agency developed two new designs. Thefirst design features several bright colors and thesecond design is light green in color with thecompanys logo on it. Management needs to decidewhich of two new packaging designs to adopt, to helpimprove sales of a certain soap.

A study is performed in two supermarkets: For the brightly-colored design to be financially viable

it has to outsell the simple design by at least 3%.



73/79

73

Summary of the experiment results Supermarket 1 - 180 purchasers of Johnson

Brothers soap out of a total of 904

Supermarket 2 - 155 purchasers of JohnsonBrothers soap out of a total of 1,038

Use 5% significance level and perform a test tofind which type of packaging to use.



74/79

74

Solution The hypotheses to test are

H0: p 1 - p 2 = .03H1: p 1 - p 2 > .03

We identify this application as case 2 (thehypothesized difference is not equal to

zero).



75/79

75

Compute: Manually

The rejection region is z > z = z.05 = 1.645.Conclusion: Since 1.15 < 1.645 do not reject the null hypothesis.There is insufficient evidence to infer that the brightly-coloreddesign will outsell the simple design by 3% or more.


15 . 1

038 , 1 ) 1493 . 1 ( 1493 .

904 ) 1991 . 1 ( 1991 .

03 . 038 , 1

155 904 180

) 1 ( ) 1 ( ) (

2

2 2

1

1 1

2 1

n

p p

n

p p

D p p Z

T i (C 2)


76/79

76

Testing p 1 p2 (Case 2) Using Excel (Data Analysis Plus)

z-Test: Two Proportions

Supermarket 1 Supermarket 2 Sample Proportions 0.1991 0.1493Observations 904 1038Hypothesized Differen 0.03z Stat 1.14P(Z


77/79

77

ESTIMATING p 1-p 2

1 1 2 21 2 / 2

1 2

( )p q p q

p p z n n

100(1 )% Confidence Interval for p 1-p 2:


78/79

78

EXAMPLE

An antibiotic for pneumonia was injected into100 patients with kidney malfunctions (calleduremic patients) and 100 patients with nokidney malfunctions (called normal patients).Some allergic reaction developed in 38 of theuremic patients and 21 of the normalpatients.

) D h d id id h


79/79

a) Do the data provide strong evidence thatthe rate of incidence of allergic reaction to

the antibiotics is higher in uremic patientsthan normal patients ?

Let p 1: the rate of incidence of allergic reaction to theantibiotics in uremic patients and

P2: the rate of incidence of allergic reaction to theantibiotics in normal patients

b) Construct a 95% confidence interval for the difference between the populationproportions and interpret the result .

12. inference about two populations

Documents