result and analysis (part 2) research methodology

60
RESULT AND ANALYSIS (part 2) RESEARCH METHODOLOGY

Upload: leo-lewis

Post on 16-Dec-2015

242 views

Category:

Documents


3 download

TRANSCRIPT

RESULT AND ANALYSIS(part 2)

RESEARCH METHODOLOGY

HYPOTHESIS TESTINGA hypothesis is a conjecture about a population

parameter. This conjecture may or may not be true.

An educated guess based on theory and background information

A proposed explanation for a phenomenon.

Hypothesis Testing is a process of using sample data and statistical procedures to decide whether to reject or not reject a hypothesis (statement) about a population parameter value.

ExamplesWhether seat belts will reduce the

severity of injuries caused by accidentWhether the public prefer certain

colour in the fabric liningWhether adding a chemical will

improve water qualityThe average life expectancy in the

next decade for man will be more than 100 years

Education increases income

education increases income

a positive relationship between the concepts "education" and "income."

This abstract or conceptual hypothesis cannot be tested. First, it must be operationalized or situated in the real world by rules of interpretation. Consider again the simple hypothesis "Education increases Income."

To test the hypothesis the abstract meaning of education and income must be derived or operationalized. The concepts should be measured. Education could be measured by "years of school completed" or "highest degree completed" etc. Income could be measured by "hourly rate of pay" or "yearly salary" etc.

Two type of statistical hypothesis The Null Hypothesis: symbolised by Ho,

states that there is no difference between a parameter and a specific value OR that there is no difference between two parameters. NULL means NO CHANGE. Statement of equality

The Alternative Hypothesis: symbolised by Ha, states a specific difference between parameter and a specific value OR states that there is a difference between two parameters. TEST or Research Hyphothesis.

Situation A: A researcher is interested in finding out whether a new medicine will have any undesirable side effects on the pulse rate of the patient. Will the pulse rate increase, decrease or remain unchanged. Since the researcher knows the pulse rate of the population under study is 82 beats per minute, the hypothesis will be

Ho : = 82 (remain uncahnged)H1 : 82 (will be different)

This is a two-tailed test since the possible effect could be to raise or lower the pulse

Situation B: A chemist invents an additive to increase the life of an automobile battery. The mean life time of ordinary battery is 36 months. The hypothesis will be:

Ho : 36Ha : > 36

The chemist is interested only in increasing the lifespan of the battery. His alternative hypothesis is that the mean is larger than 36. Therefore the test is called right-tailed, interested in the increase only.

Situation C: A contractor wishes to lower heating bill by using a special type of insulation in house. If the average monthly bill is RM100, his hypothesis will be

Ho : RM 100H1 : RM 100

This is a left-tailed test since the contractor is only interested in reducing the bill

General Procedure for testing the hypothesis. Can be done statistically.

Step 1: State the hypothesis Step 2: find critical value for a selected level of

significant or formulate an analysis plan e.g. 0.1, 0.05, 0.01. Consider case for one-tailed or two-tailed

Step 3: Analyze sample data. Step 4: Interpret results or make the decision

to reject or not to reject the hypothesis. If test value < critical value accept Ho. test value > critical value reject Ho.

significant differenceA significant difference occurs if the

difference between the hypothesized (null) value and the sample statistic value is too large to be attributed to chance. A significant difference strongly suggests that the null hypothesis is not true.

Significant difference at p<0.05 means, 95% of the time the sample mean is larger than the hypothesised value.

TESTING THE DIFFERENCE AMONG MEANS AND VARIANCE

Situations: To compare the average lifetime

of two difference brands of tires Two different brands of fertilizer,

whether one is better than the other for growing plants

Two brands of cough syrup, to test whether one brand is more effective than the other

Problem 1: Two-Tailed Test

Suppose the Acme Drug Company develops a new drug, designed to prevent colds. The company states that the drug is equally effective for men and women. To test this claim, they choose a a simple random sample of 100 women and 200 men from a population of 100,000 volunteers.

At the end of the study, 38% of the women caught a cold; and 51% of the men caught a cold. Based on these findings, can we reject the company's claim that the drug is equally effective for men and women? Use a 0.05 level of significance.

Solution:

State the hypotheses. The first step is to state the null hypothesis and an alternative hypothesis.

Null hypothesis: P1 = P2 Alternative hypothesis: P1 ≠ P2

Note that these hypotheses constitute a two-tailed test. The null hypothesis will be rejected if the proportion from population 1 is too big or if it is too small.

Formulate an analysis plan. For this analysis, the significance level is 0.05. The test method is a two-proportion z-test.

Analyze sample data. Using sample data, we calculate the pooled sample proportion (p) and the standard error (SE). Using those measures, we compute the z-score test statistic (z).

p = (p1 * n1 + p2 * n2) / (n1 + n2) = [(0.38 * 100) + (0.51 * 200)] / (100 + 200) = 140/300 = 0.467

SE = sqrt{ p * ( 1 - p ) * [ (1/n1) + (1/n2) ] } SE = sqrt [ 0.467 * 0.533 * ( 1/100 + 1/200 ) ] = sqrt [0.003733] = 0.061

z = (p1 - p2) / SE = (0.51 - 0.38)/0.061 = 2.13

where p1 is the sample proportion in sample 1, where p2 is the sample proportion in sample 2, n1 is the size of sample 2, and n2 is the size of sample 2.

Since we have a two-tailed test, the P-value is the probability that the z-score is less than -2.13 or greater than 2.13.

We use the Normal Distribution Calculator to find P(z < -2.13) = 0.017, and P(z > 2.13) = 0.017. Thus, the P-value = 0.017 + 0.017 = 0.034.

Interpret results. Since the P-value (0.034) is less than the significance level (0.05), we cannot accept the null hypothesis.

Problem 2: One-Tailed TestSuppose the previous example is stated a little bit

differently. Suppose the Acme Drug Company develops a new drug, designed to prevent colds. The company states that the drug is more effective for women than for men. To test this claim, they choose a a simple random sample of 100 women and 200 men from a population of 100,000 volunteers.

At the end of the study, 38% of the women caught a cold; and 51% of the men caught a cold. Based on these findings, can we conclude that the drug is more effective for women than for men? Use a 0.01 level of significance.

Solution: State the hypotheses. The first step is to state

the null hypothesis and an alternative hypothesis.

Null hypothesis: P1 >= P2 Alternative hypothesis: P1 < P2

Note that these hypotheses constitute a one-tailed test. The null hypothesis will be rejected if the proportion of women catching cold (p1) is sufficiently smaller than the proportion of men catching cold (p2).

Formulate an analysis plan. For this analysis, the significance level is 0.01. The test method is a two-proportion z-test.

Analyze sample data. Using sample data, we calculate the pooled sample proportion (p) and the standard error (SE). Using those measures, we compute the z-score test statistic (z).

p = (p1 * n1 + p2 * n2) / (n1 + n2) = [(0.38 * 100) + (0.51 * 200)] / (100 + 200) = 140/300 = 0.467

SE = sqrt{ p * ( 1 - p ) * [ (1/n1) + (1/n2) ] } SE = sqrt [ 0.467 * 0.533 * ( 1/100 + 1/200 ) ] = sqrt [0.003733] = 0.061

z = (p1 - p2) / SE = (0.38 - 0.51)/0.061 = -2.13

where p1 is the sample proportion in sample 1, where p2 is the sample proportion in sample 2, n1 is the size of sample 2, and n2 is the size of sample 2.

Since we have a one-tailed test, the P-value is the probability that the z-score is less than -2.13. We use the Normal Distribution Calculator to find P(z < -2.13) = 0.017. Thus, the P-value = 0.017.

Interpret results. Since the P-value (0.017) is greater than the significance level (0.01), we cannot reject the null hypothesis.

Commonly used Methods 1. z-test For detecting difference between two

means for large sample (two samples) Assumptions required

The sample must be independent, that is no relationship between the subject in the sample

The sample must be normally distributed

Example problemSuppose that in a particular geographic region, the

mean and standard deviation of scores on a reading test are 100 points, and 12 points, respectively. Our interest is in the scores of 55 students in a particular school who received a mean score of 96. We can ask whether this mean score is significantly lower than the regional mean — that is, are the students in this school comparable to a simple random sample of 55 students from the region as a whole, or are their scores surprisingly low. Calculate z – score?

solutionWe begin by calculating the standard error

(SE) of the mean:

Next we calculate the z-score, which is the distance from the sample mean to the population mean in units of the standard error:

problem1. the mean and standard deviation of scores on a calculating

test are 120 points, and 18 points, respectively. Our interest is in the scores of 81 students in a particular school who received a mean score of 92. We can ask whether this mean score is significantly lower than the regional mean — that is, are the students in this school comparable to a simple random sample of 81 students from the region as a whole, or are their scores surprisingly low. Calculate Z- score?

2. Every year, 50,000 runners compete in the Peachtree Road Race. They run 10 kilometers (a little over 6 miles). The average finishing time is 55 minutes, with a standard deviation of 10 minutes. Fred and Wilma completed the race in 61 and 51 minutes, respectively. Barney and Betty had finishing times with z-scores of -0.3 and 0.7, respectively.

List the runners in order, starting with the fastest runner and ending with the slowest runner.(A) Wilma, Barney, Fred, Betty (B) Barney, Wilma, Fred, Betty (C) Wilman, Barney, Betty, Fred (D) Betty, Fred, Barney, Wilma (E) None of the above

solution

1. Calculate (SE) of the mean:

29

18

81

18

nSE

Next we calculate the z-score

11.39

28

2

12092

SE

MZ

solution2. The answer is A. This problem can be solved by converting

Fred and Wilma's raw scores into z-scores. To do this, we use the z-score equation: To do this, we use the z-score equation:z = (M-µ) / sdwhere z is the z-score, x is the runner's raw score, M is the mean finishing time, and sd is the standard deviation of finishing times. Solving first for Fred's z-score, we get

z = (M-µ) / sd = ( 61-55) / 10 = 0.60Using the same approach to compute Wilma's z-score, we getz = (M-µ) / sd = ( 51-55) / 10 = - 0.4Based on z-scores, we can order the runners from fastest to slowest as follows: Wilma (z = -0.4), Barney (z = -0.3), Fred (z = 0.6), and Betty (z = 0.7).

problemEach year, a national achievement test is

administered to 3rd graders. The test has a mean score of 100 and a standard deviation of 15. If Jane's z-score is 1.20, what was her score on the test? (A) 82 (B) 88 (C) 100 (D) 112 (E) 118

solutionThe correct answer is (E). From the z-score

equation, we know z = (M-µ) / sd

where z is the z-score, x is the value of Jane's test score, M is the mean test score, and sd is the standard deviation of test scores. Solving for Jane's test score (M), we get

M = ( z * sd) + 100 = ( 1.20 * 15) + 100 = 18 + 100 = 118

2. F test For the comparison of two variances or

standard deviations. E.g variation in cholesterol level in man and women

Assumptions The population from which the

samples were obtained must be normally distributed

Samples must be independent of each other

Example problemConsider an experiment to

study the effect of three different levels of a factor on a response (e.g. three levels of a fertilizer on plant growth). If we had 6 observations for each level, we could write the outcome of the experiment in a table like this, where a1, a2, and a3 are the three levels of the factor being studied.

a1 a2 a3

6 8 13

8 12 9

4 9 11

5 11 8

3 6 7

4 8 12

solutionStep 1: Calculate the mean within each group:

Step 2: Calculate the overall mean:

where a is the number of groups.

Step 3: Calculate the "between-group" sum of squares:

where n is the number of data values per group.The between-group degrees of freedom is one less than the number of groups

fb = 3 − 1 = 2

so the between-group mean square value is

MSB = 84 / 2 = 42

Step 4: Calculate the "within-group" sum of squares. Begin by centering the data in each group a1 a2 a3

6 − 5 = 1

8 − 9 = -1

13 − 10 = 3

8 − 5 = 3

12 − 9 = 3

9 − 10 = -1

4 − 5 = -1

9 − 9 = 0

11 − 10 = 1

5 − 5 = 0

11 − 9 = 2

8 − 10 = -2

3 − 5 = -2

6 − 9 = -3

7 − 10 = -3

4 − 5 = -1

8 − 9 = -1

12 − 10 = 2

The within-group sum of squares is the sum of squares of all 18 values in this table

SW = 1 + 9 + 1 + 0 + 4 + 1 + 1 + 9 + 0 + 4 + 9 + 1 + 9 + 1 + 1 + 4 + 9 + 4 = 68

The within-group degrees of freedom isfW = a(n − 1) = 3(6 − 1) = 15

Thus the within-group mean square value is

Step 5: The F-ratio is

2. t-test To test the difference between two

means for small independent sample (n<30)

Assumptions Sample must be independent The populations are normally

distributed

CORRELATION AND REGRESSIONCorrelation is a statistical method used to determine

whether a relationship between variable exists. Correlation attempts to study the strength of the mutual relationship between two variables. In correlation we assume that the variables are random and dependence of any nature is not involved.

Regression describe the nature of the relationship between variables. Regression studies the relationship where dependence is necessarily involved. One variable has the dependence on a certain number of variables. Regression can be used for predicting the values of the variable which depends upon other variables.

Linear and Non Linear Correlation

Linear Correlation:Correlation is said to be linear if the ratio of change is constant. The amount of output in a factory is doubled by doubling the number of workers is the example of linear correlation.In other words it can be defined as if all the points on the scatter diagram tends to lie near a line which are look like a straight line, the correlation is said to be linear, as shown in the figure.

Non Linear (Curvilinear) Correlation:            Correlation is said to be non linear if the ratio of change is not constant. In other words it can be defined as if all the points on the scatter diagram tends to lie near a smooth curve, the correlation is said to be non linear (curvilinear), as shown in the figure.  

Positive and Negative Correlation

Positive Correlation:The correlation in the same direction is called positive correlation. If one variable increase other is also increase and one variable decrease other is also decrease. For example, the length of an iron bar will increase as the temperature increases.

Negative Correlation: The correlation in opposite direction is called negative correlation, if one variable is increase other is decrease and vice versa, for example, the volume of gas will decrease as the pressure increase or the demand of a particular commodity is increase as price of such commodity is decrease.

No Correlation or Zero Correlation:If there is no relationship between the two variables such that the value of one variable change and the other variable remain constant is called no or zero correlation.

Perfect Correlation If there is any change in the value of one variable, the

value of the others variable is changed in a fixed proportion, the correlation between them is said to be perfect correlation. It is indicated numerically as +1 and -1.

Perfect Positive Correlation:            If the values of both the variables are move in same direction with fixed proportion is called perfect positive correlation. It is indicated numerically as +1.

Perfect Negative Correlation:            If the values of both the variables are move in opposite direction with fixed proportion is called perfect negative correlation. It is indicated numerically as -1.

Coefficient of Correlation

For sample data the correlation coefficient denoted by “r” is a measure of strength of the linear relation between X and Y variables, where “r” is a pure number and lies between -1 and +1.

Examples of Correlation Calculate and analyze the correlation

coefficient between the number of study hours and the number of sleeping hours of different students.

Solution: The necessary calculation is given below:

There is perfect negative correlation between the number of study hours and the number of sleeping hours.

ProblemFrom the following data, compute the

coefficient of correlation between X and Y:

Summation of products of deviations of X and Y series from their arithmetic means = 122.

Solution:

LINEAR REGRESSIONIf the plot of n pairs of data (x , y) for an

experiment appear to indicate a "linear relationship" between y and x, then the method of least squares may be used to write a linear relationship between x and y.

The least square regression line for the set of n data points is given by

y = ax + b

where a and b are given by

Example Consider the following set of points: {(-2 , -1) ,

(1 , 1) , (3 , 2)}

a) Find the least square regression line for the given data points.

b) Plot the given points and the regression line in the same rectangular system of axes.

Solutions a) Let us organize the data in a table.

We now use the above formula to calculate a and b as follows

a = (nΣx y - ΣxΣy) / (nΣx2 - (Σx)2) = (3*9 - 2*2) / (3*14 - 22) = 23/38

b = (1/n)(Σy - a Σx) = (1/3)(2 - (23/38)*2) = 5/19

b) We now graph the regression line given by y = ax + b and the given points.

Problems2 a) Find the least square regression line for the

following set of data {(-1 , 0),(0 , 2),(1 , 4),(2 , 5)}b) Plot the given points and the regression line in the same rectangular system of axes.

3 The values of y and their corresponding values of y are shown in the table below

a) Find the least square regression line y = ax + b. b) Estimate the value of y when x = 10.

4 The sales of a company (in million dollars) for each year are shown in the table below.

a) Find the least square regression line y = ax + b. b) Use the least squares regression line as a model to estimate the sales of the company in 2012.

SOLUTION

Solution

Solution

Multiple RegressionSeveral independent variables and one dependent

Y’ = a +b1x1+ b2x2 + ……. bkxk

Assumptions for multiple regressionFor any specific value of independent variable, the

value of the y variable are normally distributed (normality assumption)

The variances or standard deviation for the y variable are the same for each value of the independent variable (equal variance assumption)

There is a linear relationship between the dependent variable and the independent variable (linearity assumption)

The independent variables are not correlated The values for the y variables are independent

NON-PARAMETRIC TESTZ, f and t-tests are parametric – when data are

normally distributedWhen data is not normally distributed – Non-

Parametric test is more appropriate.Also called Distribution Free Statistics

Non-parametric methods are widely used for studying populations that take on a ranked order (such as movie reviews receiving one to four stars). The use of non-parametric methods may be necessary when data have a ranking but no clear numerical interpretation, such as when assessing preferences; in terms of level of measurement, for data on an ordinal scale.

Advantages & Disadvantages

Advantages of Non Parametric Test Can be used when the variable is not

normally distributed Can be used when data is small Can be used to test hypothesis The computation is easier Easier to understandDisadvantages Less sensitive Less information Less efficient

USING MODELS

Be sure with data requirement and the need of the study

Consists of 4 main steps Model formulation Model optimization Model calibration/verification Model Application

Model FormulationInvolved empirical and theoretical evidencesMake assumptions – to reduce the problem to a

manageable form (simplification of process)Model optimizationRegression analysis – analytical waySubjective optimization – based on experience of

the modelersModel CalibrationChanging the coefficient Reduce error between observed and predicted

valuesModel ApplicationAfter the model has been calibrated and validated