fundamentals of statistical inference

Upload: alok-shenoy

Post on 02-Apr-2018

257 views

Category:

Documents


1 download

TRANSCRIPT

  • 7/27/2019 Fundamentals of Statistical Inference

    1/101

    Fundamentals of Statistical

    Inference

    compiled by

    Srilakshminarayana,G. M.Sc, Ph.D

    Shri Dharmasthala Manjunatheswara Institute

    for Management Development

    #1 Chamundi Hill Road, Siddhartha Nagar, Mysore-570011

    (Private Circulation Only-September 2012)

  • 7/27/2019 Fundamentals of Statistical Inference

    2/101

    Table of Contents

    Table of Contents i

    Important note about the material 1

    1 Estimation 21.1 Importance of estimation in management . . . . . . . . . . . . . . . . 21.2 Key terms in estimation . . . . . . . . . . . . . . . . . . . . . . . . . 71.3 Determination of sample size . . . . . . . . . . . . . . . . . . . . . . . 91.4 Point estimator for population mean . . . . . . . . . . . . . . . . . . 9

    1.4.1 Steps in obtaining an estimate of population mean . . . . . . 11

    1.5 Point estimator for population variance . . . . . . . . . . . . . . . . . 111.5.1 Steps in calculating an estimate of population variance . . . . 11

    1.6 Role of sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121.7 Sampling distribution of a Statistic . . . . . . . . . . . . . . . . . . . 121.8 Sampling error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.9 Point estimator for a population proportion . . . . . . . . . . . . . . 141.10 Finding the best estimator . . . . . . . . . . . . . . . . . . . . . . . . 141.11 Drawback of point estimate . . . . . . . . . . . . . . . . . . . . . . . 151.12 Interval estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151.13 Probability of the true population parameter falling within the interval

    estimate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161.14 Interval estimates and confidence intervals . . . . . . . . . . . . . . . 181.15 Relationship between confidence level and

    confidence interval . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191.16 Using sampling and confidence interval

    estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191.17 Interval estimation of population mean

    ( known) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

    i

  • 7/27/2019 Fundamentals of Statistical Inference

    3/101

    1.18 Using the Z statistic for estimating population mean . . . . . . . . . 20

    1.19 Using finite correction factor for the finite population . . . . . . . . . 251.20 Interval estimation for difference of two means . . . . . . . . . . . . . 271.21 Confidence interval estimation of the population mean ( unknown) . 281.22 Checking the assumptions . . . . . . . . . . . . . . . . . . . . . . . . 291.23 Concept of degrees of freedom . . . . . . . . . . . . . . . . . . . . . . 301.24 Confidence interval estimation for population proportion . . . . . . . 321.25 Estimation of the sample size . . . . . . . . . . . . . . . . . . . . . . 331.26 Sample size for estimating population mean . . . . . . . . . . . . . . 351.27 Sample size for estimation population proportion . . . . . . . . . . . 401.28 Sample size for an interval estimate of a population proportion . . . . 411.29 Further discussion of sample size determination for a proportion . . . 42

    2 Testing of Hypothesis-Fundamentals 442.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442.2 Formats of Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . 472.3 The rationale for hypothesis testing . . . . . . . . . . . . . . . . . . 482.4 Steps in hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . 492.5 One tail and two tail tests . . . . . . . . . . . . . . . . . . . . . . . . 55

    2.5.1 One tailed test . . . . . . . . . . . . . . . . . . . . . . . . . . 562.5.2 Two tailed test . . . . . . . . . . . . . . . . . . . . . . . . . . 56

    2.6 Critical region and non-critical region . . . . . . . . . . . . . . . . . . 56

    2.7 Errors in hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . 572.8 Test for single mean . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

    2.8.1 Z-test for single mean- known case . . . . . . . . . . . . . . . 592.8.2 Testing Using Excel . . . . . . . . . . . . . . . . . . . . . . . . 602.8.3 t-test for single mean- unknown case . . . . . . . . . . . . . . 612.8.4 Testing Using Excel . . . . . . . . . . . . . . . . . . . . . . . . 62

    2.9 Test for single proportion . . . . . . . . . . . . . . . . . . . . . . . . . 632.9.1 Testing Using Excel . . . . . . . . . . . . . . . . . . . . . . . . 64

    2.10 Comparison and conclusion . . . . . . . . . . . . . . . . . . . . . . . 65

    3 Testing of hypothesis-Two sample problem 673.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 673.2 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 673.3 Test for difference of means: Z-test . . . . . . . . . . . . . . . . . . . 68

    3.3.1 Testing Using Excel: 21 = 22 =

    2 (known) . . . . . . . . . . 693.3.2 Testing Using Excel: Unequal Variances (Known) . . . . . . . 70

    3.4 Test for difference of means:t-test . . . . . . . . . . . . . . . . . . . . 713.4.1 Testing Using Excel: 21 =

    22 =

    2 (Unknown) . . . . . . . . . 72

    ii

  • 7/27/2019 Fundamentals of Statistical Inference

    4/101

    3.4.2 Testing Using Excel: Unequal Variances (Unknown) . . . . . . 73

    3.5 Test for difference of two proportions . . . . . . . . . . . . . . . . . . 743.5.1 Testing Using Excel: Test for Difference of Proportions . . . . 753.6 Test for dependent samples . . . . . . . . . . . . . . . . . . . . . . . . 76

    3.6.1 Testing Using Excel . . . . . . . . . . . . . . . . . . . . . . . . 773.7 Test for difference of variances-F Test . . . . . . . . . . . . . . . . . . 783.8 Comparison and conclusion . . . . . . . . . . . . . . . . . . . . . . . 78

    4 Chi-Square tests 804.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

    4.1.1 Chi-square test for significance of a population variance . . . . 814.1.2 Chi-square test for goodness of fit . . . . . . . . . . . . . . . . 81

    4.1.3 Chi-square test for independence of attributes . . . . . . . . . 824.2 Comparison and conclusion . . . . . . . . . . . . . . . . . . . . . . . 82

    5 Analysis of Variance (ANOVA) 845.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 845.2 One way ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

    5.2.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 855.2.2 Steps for computing the F test value for ANOVA . . . . . . . 86

    5.3 Two-Way Analysis of Variance . . . . . . . . . . . . . . . . . . . . . . 885.3.1 Assumptions for the Two-Way ANOVA . . . . . . . . . . . . . 90

    5.4 The Scheffe Test and the Tukey Test . . . . . . . . . . . . . . . . . . 915.4.1 Scheffe Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . 915.5 Tukey Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

    6 Correlation and Regression 936.1 Testing significance of Correlation = 0 . . . . . . . . . . . . . . . . 936.2 Testing significance of Correlation = 0 . . . . . . . . . . . . . . . . 946.3 Testing significance of correlation 1 = 2 . . . . . . . . . . . . . . . . 956.4 Testing significance of regression model . . . . . . . . . . . . . . . . . 96

    References 97

    iii

  • 7/27/2019 Fundamentals of Statistical Inference

    5/101

    Important note about the material

    This material is for internal circulation only and not a substitute for a text book.

    It contains only fundamental steps to be followed when inferential tools are used to

    analyze the data. It is restricted to the need of present batch and do not contain

    entire information about the topic. The complete information can be found in the

    text book prescribed and in other references.

    1

  • 7/27/2019 Fundamentals of Statistical Inference

    6/101

    Chapter 1

    Estimation

    1.1 Importance of estimation in management

    Everyone makes estimates. When you are ready to cross a street, you estimate the

    speed of any car that is approaching, the distance between you and that car, and

    your own speed. Having made these quick estimates, you decide whether to wait,

    walk, or run. All managers must make quick estimates too. The outcome of these

    estimates can affect their organizations as seriously as the outcome of your decision as

    to whether to cross the street. University department heads make estimates of next

    sessions enrollment in Statistics. Credit managers estimate whether a purchaser will

    eventually pay his bills. Prospective home buyers make estimates concerning the

    behavior of interest rates in the mortgage market. All these people make estimates

    without worry about whether they are scientific but with hope that he estimates

    bear a reasonable resemblance to the outcome. Managers use estimates because in all

    but the most trivial decisions, they must make rational decisions without complete

    information and with a great deal of uncertainty about what the future will bring.

    How do managers use sample statistics to estimate population parameters? The

    department head attempts to estimate enrollments next fall from current enrollments

    2

  • 7/27/2019 Fundamentals of Statistical Inference

    7/101

    Estimation 3

    in the same courses. The credit manager attempts to estimate the creditworthiness of

    prospective customers from a sample of their past payment habits. The home buyer

    attempts to estimate the future course of interest rates by observing the current

    behavior of those rates. In each case, somebody is trying to infer something about a

    population from information taken from a sample. This chapter introduces methods

    that enable us to estimate with reasonable accuracy the population proportion (the

    proportion of the population that possesses a given characteristic) and population

    mean. To calculate exact proportion or the exact mean would be an impossible goal.

    Even so, we will be able to make an estimate, make a statement about the error that

    will probably accompany this estimate, an implement some controls to avoid as much

    of the error as possible. As decision makers, we will be forced at times to rely on

    blind hunches. Yet in other situations, in which information is available and we apply

    statistical concepts, we can do better than that.

    Let us start with a small discussion on why a management student should study

    statistical methods to estimate the unknown quantities. Estimations are made at

    low level, middle level, and high level management. At any level, we understand

    the present carefully. Then look into the past. See what has happened in the past.

    List all possible options we have gathered from the past. Choose the best from the

    available options. The option that best suits the present is considered as a solution.

    For example, a manager of an production unit wishes to estimate the items to be

    produced for the current year and depending on his estimate he wishes to place an

    order for the raw materials. He takes his records and looks into the items produced,

    raw materials used to produce these items. Finally he uses his experience and takes

    a decision on the items to be produced for the current year and depending on the

    estimate he prepares an order for the raw materials. But what is the guarantee that

    the value he estimated is free of error? How can he justify that the actual requirement

    is close to the value he estimated? There is a chance that the value he estimated using

  • 7/27/2019 Fundamentals of Statistical Inference

    8/101

    Estimation 4

    his experience may be an over estimate or under estimate. How can he convince his

    boss that the value he chose will yield the organization better profits? If everything

    goes fine, then no one will blame him. It would have been better if life is free of

    uncertainty. But it is not so. The manager should take care of the uncertainty

    associated with the estimate obtained using his experience. At this stage one can

    argue saying that he is experienced and he can specify a range instead of a single

    value. The statement could be May be this time the requirement lie in between

    10000 and 15000. Even now there is an amount of uncertainty associated with

    this. Because the words May be indicate that there is an amount of uncertainty

    associated. One can continue the argument. Finally what is that we want. We want

    a statement that the requirement for the current year may lie between 10000 and

    15000 and the chance that it lie outside this range is 0.05. How could we get this

    chance of 0.05? It is due to the systematic procedures available in statistics. Why he

    needs it because the manager has to report to his boss saying that the requirement

    for the current year lies between limits A and B. The boss is much concerned about

    satisfying the needs of the customers and if anything goes wrong it is he who will be

    targeted first. To avoid this the manager can use the statistical techniques available

    and provide the range along with a chance. This is again done taking the past data

    into consideration. Systematic construction needs understanding the past carefully

    and choosing an appropriate tool for the given situation. One has to choose the

    technique based on the study variable under consideration. This is because the tools

    used for a quantitative variable cannot be used for a qualitative variable without

    proper adjustment.

    The example discussed above is from a production unit. Similarly, let us consider

    marketing. The sales executive has to report to his boss the number of packets of

    oil he will sell this month. What will he do when his boss asks him about this? He

    immediately says that he will sell 150 packets of oil this month. How did he say this?

  • 7/27/2019 Fundamentals of Statistical Inference

    9/101

    Estimation 5

    He didnt use statistics to do this. He just used his experience to do this. There is

    no point in taking an excel sheet and use a statistical procedure to give this number.

    He used his common sense, past experience and market conditions. He is sure that

    the market need atleast 150 packets this month and he already sold 145 packets last

    month. He also has complete knowledge about his competitors sales in the market.

    Taking into consideration all these factors he could easily estimate the current months

    sales. Let us consider another example. Suppose that this time the sales executive

    has been promoted as sales manager. Now he has to estimate the sales of the entire

    region. Now the problem is he is the manager but not a sales executive. He has to

    take the data from the sales executives of the entire region and then he has to estimate

    the sales for the current year. Depending on this estimate he has to build a strategy

    to increase the sales. In the previous case he can get on because of his experience and

    common sense. But now he is a manager and he cant take any risk. Now also he can

    use his experience. But this time only to develop a proper strategy. He should take

    help of statistical methods in order to give a proper estimate and to construct a better

    strategy. What will he do? He will take the data from the sales executives and takes

    the average of all the values and adjusts that according to the market conditions and

    finally gives an estimate. Is it a good estimate? what adjustment he should make to

    the average and give an estimate which convinces his boss? The answer is very simple.

    According to the statistical theory, the sample average best estimates the population

    average. Here the population average is the sales for the current year and the sample

    average is the value he calculated after obtaining the data from his executives. What

    about the adjustment? The adjustment is to construct an interval associated with a

    probability value, take a value within the interval and consider it as an estimate.

    Let us consider the case in Human resource management (HRM). Suppose that

    the HR manager wishes to know about the performance of new appraisal system

  • 7/27/2019 Fundamentals of Statistical Inference

    10/101

    Estimation 6

    developed to appraise the employees. Since the organization has thousands of em-

    ployees it is apparent that she cant take the opinion of all the employees. She has

    to take a sample of employees and consider their opinion. Here the estimate is the

    proportion of employees who are against the system, which is an estimator of the

    population proportion. The variable under consideration is a qualitative variable and

    the appropriate estimator is the sample proportion.

    Management is a discipline which uses statistical tools to support the decisions

    relating to various business situations. Most of the times the decision maker will be

    left with some amount of data relating to the given situation, on which he is supposed

    to take a decision. It is always desirable to use the data obtained and take appropriate

    decision. One important aspect in decision making is estimation. This is a part of

    statistical inference. Estimation is a systematic way of understanding the behavior of

    unknown population characteristics based on a sample. These characteristics include

    all the descriptive statistics related to a properly defined population. But most of the

    time we are interested in population mean and variance. These are the characteristics

    which play an important role in making decisions. It is very important to note that

    mean should always be followed by the variance. Mean measures the central tendency

    and variance measures the dispersion. To estimate these characteristics, we use the

    sample data gathered from the defined population. The sample is selected as the

    true representative of the population selected for the study. Note that care has to be

    taken while selecting the sample. Coming back to the estimation, we use the sample

    characteristics to estimate the population characteristics. Sample mean and variance

    are used to estimate the population mean and variance. Two types of estimation

    has been studied formally by the researchers. They are point estimates and interval

    estimates. A point estimate is the value of the statistic for a given sample. We

    use sample statistics as estimators to estimate the population parameters. These

    estimators are functions of the sample i.e., they produce different values for different

  • 7/27/2019 Fundamentals of Statistical Inference

    11/101

    Estimation 7

    samples. Each value is considered as the estimate of the parameter. Point estimates

    obtained for different samples put together constitute sampling distribution of the

    statistic. Usual understanding in estimation is that for sufficiently large samples

    these sample means, when plotted, produce a normal curve. This basic assumption

    is very important to construct an interval estimate. Another important aspect in

    point estimation is the associated sampling error of the statistic. When we obtain

    the point estimate from a sample, it is equivalently important to obtain the sampling

    error or standard deviation of the statistic. This sampling error gives the amount of

    fluctuation that can be allowed below and above the estimate.

    The purpose of any random sample is to estimate population properties of a popula-

    tion from the data observed in the sample. The mathematical procedures appropriate

    for performing this estimation depend on which properties are of interest and which

    type of random sampling scheme is used. Note that the sampling scheme has to be

    selected appropriately for a given situation. The decision maker has to take care of

    the assumptions made at the time of selecting the sampling scheme. This is very im-

    portant because the assumptions of the mathematical model that will be used in the

    later stages should coincide with the assumptions made at the time of selecting the

    sample. If this is not taken care, then the results obtained may not be reliable. Along

    with this, another aspect that play an important role is sampling error. Sampling

    error is the inevitable result of basing an inference os a random sample rather than

    on the entire population.

    1.2 Key terms in estimation1. Population: Group of objects or individuals that posses the assumed charac-

    teristics under study. This group can be finite or infinite.

    2. Sample: Group of objects or individuals that posses the same characteristics

    as that of population, taken for enumeration and further analysis. This group

    is considered as the true representative of the entire population under study.

  • 7/27/2019 Fundamentals of Statistical Inference

    12/101

    Estimation 8

    3. Parameter: Unknown characteristics of the population under study such as

    population mean, median, mode, standard deviation etc.

    4. Statistic: Characteristics of the sample such as sample mean, median, mode,

    standard deviation etc.

    5. Estimator: Any statistic, which is a function of sample values, used to estimate

    a population parameter.

    6. Estimate: An estimate is a specific value of the estimator for a given sample.

    7. Point estimate: A point estimate is a numerical value, a best guess of a

    population parameter, based on the data in a sample.

    8. Estimation error: The estimation error is the difference between the point

    estimate and the true value of the population parameter being estimated.

    9. Interval estimate: An interval estimate is an interval around the point esti-

    mate, calculated from the sample data, where we strongly believe the true value

    of the population parameter lie.

    10. Unbiased estimate: An unbiased estimate is a point estimate such that themean of its sampling distribution is equal to the true value of the population

    parameter being estimated.

    11. Efficiency: Another desirable property of a good estimator is that it be ef-

    ficient. Efficiency refers to the size of the standard error of the statistic. If

    we compare two statistics from a sample of the sample size and try to decide

    which one is the more efficient estimator, we would pick the statistic that has

    the smaller standard error, or standard deviation of the sampling distribution.

    12. Sufficiency: An estimator is sufficient if it makes so much use of the infor-

    mation in the sample that no other estimator could extract from the sample

    information about the population parameter being estimated.

    13. Consistency: A point estimator is said to be consistent if its value tends to

    become closer to the population parameter as the sample size increases.

  • 7/27/2019 Fundamentals of Statistical Inference

    13/101

    Estimation 9

    1.3 Determination of sample size

    There are several ways to estimate an unknown characteristic of the population. In

    this compiled work we only discuss parametric estimation.Interested can look into

    standard book for other methods like non-parametric estimation, robust estimation

    etc.In parametric estimation we mainly talk about the population characteristics like

    mean, variance/standard deviation and proportion. We first discuss determination

    of sample size in detail and then proceed to estimation procedures.At an intermedi-

    ate stage, i.e. after collecting the sample from the population under study, we look

    forward to understand the behavior of the population through the estimated char-

    acteristics from the sample. Hence one has to note at this point that the sample

    taken play an important role in studying the population.Now the question is what

    should be the sample size. This is an interesting question, which do not have a ready

    made answer. It is an important step before the survey. Note that sampling error,

    decreases with increase in the sample size. So we use the fact the smaller the variance,

    the larger the sample size needed to achieve a degree of accuracy.

    Determining the best sample size is not just a statistical decision. Statisticians

    can tell you how the standard error behaves as you increase or decrease the sample

    size, and the market researchers can tell you what the cost of taking more or larger

    samples will be. But its the decision maker who must use your judgement to combine

    these two inputs to make a sound managerial decision.

    1.4 Point estimator for population mean

    Definition 1. Point estimator:A sample statistic that is calculated using sample data to estimate most likely value of

    the corresponding unknown population parameter is termed as point estimator, and the

    numerical value of the estimator is termed as point estimate. A point estimate consists

    of a single sample statistic that is used to estimate the true value of a population

    parameter.

  • 7/27/2019 Fundamentals of Statistical Inference

    14/101

    Estimation 10

    For example, the sample mean X is a point estimate of the population mean

    and the sample variance S2 is a point estimate of the population variance 2. On

    many occasions estimating the population mean is useful in business research. For

    example,1. The manager of human resources in a company might want to estimate the

    average number of days of work an employee misses per year because of illness.

    If the firm has thousands of employees, direct calculation of a population mean

    such as this may be practically impossible. Instead, a random sample of em-

    ployees can be taken, and the sample mean number of sick days can be used toestimate the population mean.

    2. Suppose that another company developed a new process for prolonging the

    shelf life of a loaf of bread. The company wants to be able to date each loaf for

    freshness, but company officials do not know exactly how long the bread will

    stay fresh. By taking a random sample and determining the sample mean shelf

    life, they can estimate the average shelf life for population of bread.

    3. As the cellular telephone industry matures, a cellular telephone company isrethinking its pricing structure. Users appear to be spending more time on

    the phone and are shopping around for the best deals. To do better planning,

    the cellular company wants to ascertain the average number of minutes of time

    used per month by each of its residential users but does not have the resources

    available to examine all monthly bills and extract the information. The company

    decides to take a random sample of customer bills and estimate the population

    mean from sample data. A researcher for the company takes a random sampleof 85 bills for a recent month and from these bills computes a sample mean

    of 510 min. This sample mean, which is a statistic, is used to estimate the

    population mean, which is a parameter. If the company uses the sample mean

    of 510 min as an estimate for the population mean, then the sample mean is

    used as a point estimate.

  • 7/27/2019 Fundamentals of Statistical Inference

    15/101

    Estimation 11

    4. A tire manufacturer developed a new tire designed to provide an increase in

    mileage over the firms current line of tires. To estimate the mean number of

    miles provided by the new tires, the manufacturer selected a sample of 120 new

    tires and observed a sample mean of 36,500 miles.

    In all the above examples, note the statistic (sample mean) is a function of the

    sample drawn from the population under study and the numerical value assumed by

    this statistic is an estimate of the population mean. (Observe the difference between

    an estimator and an estimate).

    1.4.1 Steps in obtaining an estimate of population mean

    1. Draw a sample from the population under study.

    2. Find the total of all the observations in the sample.

    3. Divide the total with the number of observations.

    4. The resultant value is the sample mean, which is taken as the estimate of the

    population mean.

    1.5 Point estimator for population varianceThe estimation of population variance is an important step in analyzing the sample

    drawn from the population under study. We use sample variance to estimate the

    population variance. But sample variance is not an unbiased estimator of population

    variance. So we modify the formula used to calculate the sample variance. The

    formula to calculate the sample variance is given by

    2

    = s2

    =

    1

    n

    ni=1

    (Xi X)2

    In order get an unbiased estimator, ne has to change1

    nto

    1

    n 1. The resultant iscalled Means square error, which gives an unbiased estimator of population variance.

    1.5.1 Steps in calculating an estimate of population variance

    1. Calculate the mean of the sample drawn.

  • 7/27/2019 Fundamentals of Statistical Inference

    16/101

    Estimation 12

    2. Compute the deviation of all the observations from the mean.

    3. Square the deviations and obtain the total.

    4. Divide the total obtained in step 3 with n 1.

    Note 1. Note that, the above formulae to calculate mean and variance are used when

    individual observations are taken. If one is using a frequency distribution then they

    have to include the frequencies to calculate mean and variance.

    1.6 Role of sampling

    In order to understand the population characteristics like mean, variance etc. it is

    very important to draw a sample which is a true representative of the population.

    Proper sampling design has to be adopted before drawing a sample. Sampling frame

    should be framed and then should be checked with population. Care should be taken

    to decrease the non-response rate. It should be noted that a random sample will

    better estimate the population parameters than a non-random sample. In order to

    get better estimate, it is also important to ensure that the sample is free of any sort of

    bias. The questionnaire framed to collect the responses should be tested using a pilot

    survey before the actual survey. One has to note that pilot survey has to be framed

    in such a way that it resembles the actual survey and should give better insights

    about the resources needed to conduct the actual survey. An interesting point is

    that samples with smaller samples, which is a true representative, gives satisfactory

    results than a sample with larger sample size, which is not a true representative

    of the population. Another interesting aspect in sampling is the belief that larger

    populations need larger samples is not always a valid statement. Depending on the

    situation, objectives, a sample should be taken.

    1.7 Sampling distribution of a Statistic

    Sampling distribution is the underlying probability distribution of the statistic used

    for the study. This is constructed by taking several samples from the population.

  • 7/27/2019 Fundamentals of Statistical Inference

    17/101

    Estimation 13

    For example, a sampling distribution of sample mean is constructed by taking as

    many samples as possible from the population and by calculating sample mean for

    all the samples. The set of all values constitute a sampling distribution of sample

    mean. Theoretically, it has been shown that the sampling distribution of mean is

    either normal (central limit theorem-finite known variance-larger sample size) or a

    t-distribution (when assumption of normality is satisfied-small sample sizes). When

    the assumption of normality is not satisfied, then the sampling distribution of sample

    mean can be approximated to normal law using central limit theorem for sufficiently

    large sample sizes. The sampling distribution of sample variance or mean square error

    is chi-square distribution (it is discussed in detail in chapter 4).

    1.8 Sampling error

    After drawing a sample, it is important to study the sampling error. For this, the

    decision maker has to find the standard error of the estimator used to estimate the

    population parameter. The question is what is the relation between the standard error

    and sampling error? Note that, the sample is drawn to understand the behaviour of

    the population characteristics (like mean, median etc.) and are studied using their

    estimators from a sample. Obviously if the sampling error is more, then it will be

    reflected in the standard error of the estimator. Also note that, reciprocal of standard

    error gives the precision of the estimator. This is because it is expected that the

    absolute difference between the true population characteristic and sample estimator

    is less than , where depends on standard error. Refer to determination of sample

    size section to understand this better.

    Sampling variation is the price we pay for working with a sample rather than the

    population.

  • 7/27/2019 Fundamentals of Statistical Inference

    18/101

    Estimation 14

    1.9 Point estimator for a population proportion

    When the underlying variable is a qualitative variable, one is interested in studying

    the proportion of individuals who satisfy a particular attribute. For example, the

    sales manager may be interested in studying the proportion of individuals who give

    more importance to quality than cost. Here, he may confine to the customers who

    are regular in purchasing from his store. For this properly defined population, the

    parameter is proportion (denoted by P) and the sample statistic (denoted by p of

    P) is the unbiased estimator. To calculate the sample proportion, one has to define

    the random variable under study properly. Then, identify the individuals who satisfy

    the attribute (denoted by X) and take the ratio of X and n, the sample size, to

    get the estimate. Note that the sampling distribution on sample proportion can be

    approximated to normal distribution. But the exact probability distribution used

    to model the number of individuals who fall under a particular category is binomial

    distribution.

    1.10 Finding the best estimator

    A given sample statistic is not always the best estimator of its analogous population

    parameter. Consider a symmetrically distributed population in which the values of

    the median and the mean coincide. In this instance, the sample mean would be

    an unbiased estimator of population median. Also, the sample mean would be a

    consistent estimator of the population median because, as the sample size increases,

    the value of the sample mean would tend to come very close to the population median.

    And the sample mean would be a more efficient estimator of the population medianthan the sample median itself because in large samples, the sample mean has s smaller

    standard error than the sample median. At the same time, the sample median in a

    symmetrically distributed population would be an unbiased and consistent estimator

    of the population mean but not the most efficient estimator because in large samples,

    its standard error is larger than that of the sample mean.

  • 7/27/2019 Fundamentals of Statistical Inference

    19/101

    Estimation 15

    1.11 Drawback of point estimate

    The drawback of a point estimate is that no information is available regarding its re-

    liability, i.e., how close it is to its true population parameter. In fact, the probability

    that a single sample statistic actually equals the population parameter is extremely

    small. For this reason, point estimates are rarely used alone to estimate population

    parameters. It is better to offer a range of values within which the population pa-

    rameters are expected to fall so that reliability (probability) of the estimate can be

    measured. This is the purpose of the interval estimation.

    1.12 Interval estimation

    In most of the cases, a point estimate does not provide information about how close

    is the estimate to the population parameter unless accompanied by a statement of

    possible sampling error involved based on the sampling distribution of the statistic.

    It is therefore important to know the precision of an estimate before depending on

    it to make a decision. Thus, decision-makers prefers to use an interval estimate (i.e.

    the range of values defined around a sample statistic) that is likely to contain the

    population parameter value.

    Interval estimation is a rule for calculating two numerical values, say and that

    create an interval that contains the population parameter of interest. This interval is

    therefore commonly referred to as a confidence coefficient and denoted by . However,

    it is also important to state how confident one should be that the interval estimate

    contains the parameter value. Hence an interval estimate of a population parameter

    is a confidence interval with a statement of confidence (probability) that the inter-

    val contains the parameter value. In other words, confidence interval estimation is

    an interval of values computed from sample data that is likely to contain the true

    population parameter value.

    Suppose the marketing research director needs an estimate of the average life in

  • 7/27/2019 Fundamentals of Statistical Inference

    20/101

    Estimation 16

    months of car batteries his company manufactures. We select a random sample of

    200 batteries, record the car owners names and addresses as listed in store records,

    and interview these owners about the battery life they have experienced. Our sample

    of 200 users has a mean battery life of 36 months. If we use the point estimate of the

    sample mean as the best estimator of the population mean , we would report that

    the mean life of the companys batteries is 36 months. But director also asks for a

    statement about the uncertainty that will be likely to accompany this estimate, that

    is, a statement about the range within which the unknown population mean is likely

    to lie. To provide such a statement, we need to find the standard error of the mean.

    The general form of an interval estimate is as follows:

    Point estimate Margin of errorThe purpose of an interval estimate is to provide information about how close the

    point estimate is to the value of the population parameter. The general form of an

    interval estimate of a population mean is

    X

    Margin of error

    The general form of an interval estimate of a population proportion is

    P Margin of errorThe sampling distribution of X and P play key roles in company these interval esti-

    mates.

    1.13 Probability of the true population parameter

    falling within the interval estimateTo begin to solve this problem, we should review the relevant concepts that we worked

    with the normal probability distribution and learned that specific portions of the area

    under the normal curve are located between plus and minus any given number of

    standard deviations from the mean. Fortunately, we can apply these properties to

    the standard error of the mean and make the statement about range of values used to

  • 7/27/2019 Fundamentals of Statistical Inference

    21/101

    Estimation 17

    make an interval estimate. Note that if we select and plot a large number of sample

    means from a population, the distribution of these means will approximate normal

    curve. Furthermore, the mean of the sample means will be the same as the population

    mean. Our sample size of 200 (in battery example) is large enough that we can apply

    the central limit theorem. To measure the spread, or dispersion, in our distribution

    of sample means, we can use the following formula and calculate the standard error

    of the mean:

    Standard error of the mean for an infinite population

    X =

    n Standard deviation of the population

    Suppose we have already estimated the standard deviation of the population of the

    batteries and reported that it is 10 months. Using this standard deviation, we can

    calculate the standard error of the mean:

    X =

    n=

    10200

    = 0.707 month

    We could now report to the director that our estimate of the life of the companys

    batteries is 36 months, and the standard error that accompanies this estimate is

    0.707. In other words, the actual mean life for all the batteries may lie somewhere

    in the interval estimate of 35.293 to 36.707 months. This is helpful but insufficient

    information for the director. Next we need to calculate the chance that the actual life

    will lie in this interval or in other intervals of different widths that we might choose,

    2(2 0.707),3(3 0.707), and so on.The probability is 0.955 that the mean of a sample size of 200 will be within

    2 standard errors of the population mean. Stated differently 95.5 percent of allthe sample means are within 2 standard errors from , and hence is within 2standard errors of 95.5 percent of all the sample means. Theoretically, if we select

    1,000 samples at random from a given population and then construct an interval of

  • 7/27/2019 Fundamentals of Statistical Inference

    22/101

    Estimation 18

    2 standard errors around the mean of each of these samples, about 955 of theseintervals will include the population mean. Similarly, the probability is 0.683 that

    the mean of the sample will be within 1 standard error of the population mean,and so forth. This theoretical concept is basic to study interval construction and

    statistical inference. Applying this to the battery example, we can now report to the

    director that our best estimate of the life of the companys batteries is 36 months, and

    we are 68.3 percent confident that the life lies in the interval from 35.293 to 36.707

    months ( 36 1x). Similarly, we are 95.5 percent confident that the life falls withinthe interval of 34.586 to 37.414 months (362x ), and we are 99.7 percent confidentthat battery life falls within the interval of 33.879 to 38.121 months (36 3x ).

    1.14 Interval estimates and confidence intervals

    In using interval estimates, we are not confined to 1, 2, and 3 standard errors. Forexample, 1.64 standard error includes about 90 percent of the area under the curve;it includes 0.4495 of the area on either side of the mean in a normal distribution.

    Similarly 2.58 standard errors include 99 percent of the area, or 49.51 percent oneach side of the mean.

    In statistics, the probability that we associated with an interval estimate is called

    the confidence level. This probability indicates how confident we are that the interval

    estimate will include the population parameter. A higher probability means more

    confidence. In estimation, the most commonly used confidence levels are 90 percent,

    95 percent, and 99 percent, but we are free to apply any confidence level.

    The confidence interval is the range of the estimate we are making. If we report

    that we are 90 percent confident that the mean of the population of incomes of people

    in a certain community will lie between 8000and24000, then the range 800024000is our confidence interval. Often, however, we will express the confidence interval in

  • 7/27/2019 Fundamentals of Statistical Inference

    23/101

    Estimation 19

    standard errors rather than in numerical values. Thus, we will often express confi-

    dence intervals like this: X 1.64X , where

    X+ 1.64X = Upper limit of the confidence interval

    X+ 1.64X = Lower limit of the confidence interval

    Thus, confident limits are the upper and lower limits of the confidence interval. In

    this case, X+ 1.64X is called the upper confidence limit (UCL) and X 1.64X isthe lower confidence limit (LCL).

    1.15 Relationship between confidence level and

    confidence interval

    You may think that we should use a high confidence level, such as 99 %, in all

    estimation problems. After all, a high confidence level seems to signify a high degree

    of accuracy in the estimate. In practice, however, high confidence levels will produce

    large confidence intervals and such large intervals are not precise; they give very fuzzy

    estimates.

    1.16 Using sampling and confidence interval

    estimation

    We described that samples being drawn repeatedly from a given population in order

    to estimate a population parameter. We also mentioned selecting a large number of

    sample means from a population. In practice, however, it is often difficult or expensive

    to take more than one sample from a population. Based on just one sample, we

    estimate the population parameter. We must be careful, then, about interpreting the

    results of such a process.

  • 7/27/2019 Fundamentals of Statistical Inference

    24/101

    Estimation 20

    Suppose we calculate from one sample in our battery example the following con-

    fidence interval and confidence level: We are 95 percent confident that the mean

    battery life of the population lies within 30 and 42 months. This statement does not

    mean that the chance is 0.95 that the mean life of all our batteries falls within the

    interval established from this one sample. Instead, it means that if we select many

    random samples of the same size and calculate a confidence interval for each of these

    samples, then in about 95 percent of these cases, the population mean will lie within

    that interval.

    1.17 Interval estimation of population mean

    ( known)

    In order to develop an interval estimate of a population mean, either the population

    standard deviation or the sample standard deviation must be used to compute the

    margin of error. Although rarely known exactly, historical data or other information

    available in some applications permit us to obtain a good estimate of the population

    standard deviation prior to sampling. In such cases, population standard deviation

    can, for all practical purposes, be considered known. We refer to such cases as the

    known case.

    1.18 Using the Z statistic for estimating popula-

    tion meanNote that a complete census is neither a feasible, nor a practical option. In order to

    draw an inference about the population, a researcher has to take a sample and has

    to apply statistical techniques to estimate population parameter on the basis of the

    sample statistics. For example, a researcher can use two methods to find out the rate

  • 7/27/2019 Fundamentals of Statistical Inference

    25/101

    Estimation 21

    of absenteeism in a manufacturing company with 500,000 employees. The first method

    is to go in for a census and calculate the rate of absenteeism based on information from

    all the 500,000 employees. This would be extremely difficult in terms of execution

    and would be time-consuming and costly. Instead of this, a researcher can take a

    sample of any size (keeping in mind the definition of small-and large-sized samples)

    and can make an estimate based on the information obtained from the sample. The

    possibility of committing non-sampling errors will also be minimized if this method

    is used. We need to develop a statistical tool that provides a good estimate of the

    population parameter on the basis of the sample statistic. The Z statistic can be

    used for estimating the population parameter on the basis of the sample statistic.

    According to the central limit theorem, the sample means for a sufficiently large

    samples (n 30 ), are approximately normally distributed, regardless of the shapeof the population distribution. For a normally distributed population, sample means

    are normally distributed for any size of the sample.

    Suppose the population mean is unknown and the true population standard

    deviation is known. Then for a large sample size (n 30 ), the sample mean X isthe best point estimator for the population mean . Since the sampling distribution

    is approximately normal, it can be used to compute confidence interval of population

    mean as follows:

    X Z2

    nor

    X Z2

    n X+ Z

    2

    n

    ,

    where Z2

    is the Z-value representing an area 2

    in the right tail of the standard

    normal probability distribution, and (1 ) is the level of confidence.Alternative approach:

    A (1) 100% large sample confidence interval for a population mean can also be

  • 7/27/2019 Fundamentals of Statistical Inference

    26/101

    Estimation 22

    found by using the statistic

    Z = X n

    which has a standard normal distribution (i.e, Z N(0, 1)). This formula can berearranged algebraically for population mean

    = X Z n

    Sample mean can be greater than or less than the population mean; hence, the

    formula takes the following form is the area under the normal curve which is outside

    the confidence interval and is located in the tails of the normal curve. Confidence

    interval is the range within which we can say with some confidence that the population

    mean is located. We can say with some confidence, however, we are not absolutely

    sure that the population mean is within the confidence interval. In order to be 100%

    sure that the population mean is within the confidence interval, the confidence level

    should be 100%, that is, indefinitely wide, which would be meaningless. We use

    the concept of probability in order to define some certainty. We can assign some

    probability that the population mean is located within the confidence interval.

    If Z2

    is the Z-value with an area

    2in the right tail of normal curve, then we can

    write

    P

    Z

    2 0.It is the logical opposite of the null hypothesis. In other words, when null

    hypothesis is found to be true, the alternative hypothesis must be false or when

    null hypothesis is found to be false, the alternative hypothesis must be true. The

    alternative hypothesis represents the conclusion reached by rejecting the null

    hypothesis if there is sufficient evidence from the sample information to decide

    that the null hypothesis is unlikely to be true. Hypothesis-testing methodology

    is designed so that the rejection of the null hypothesis is based on evidence from

    the sample that the alternative hypothesis is far more likely to be true. However,

    failure to reject the null hypothesis is not proof that it is true. One can never

    prove that the null hypothesis is correct because the decision is based only on the

    sample information, not on the entire population. Therefore, if you fail to reject

    the null hypothesis, you can only conclude that there is insufficient evidence to

    warrant its rejection. A summary of the null and alternative hypothesis is

    presented below:

    The Null and alternative hypothesis:

    (a) The null hypothesis H0 represents the status quo or the current belief in

    a situation.

    (b) The alternative hypothesis H1 is the opposite of the null hypothesis and

    represents a research claim or specific inference you would like to prove.

    (c) If you reject the null hypothesis, you have statistical proof that the alter-

    native hypothesis is correct.

    (d) If you do not reject the null hypothesis, then you have failed to prove the

    alternative hypothesis. The failure to prove the alternative, however, does

    not mean that you have proven null hypothesis.

  • 7/27/2019 Fundamentals of Statistical Inference

    56/101

    Testing of Hypothesis 52

    (e) The null hypothesis H0 always refers to a specified value of the population

    parameter (such as ), not a sample statistic (such as X).

    (f) The statement of the null hypothesis always contains an equal sign re-

    garding the specified value of the population parameter (e.g. H0 : =

    368 grams).

    (g) The statement of the alternative hypothesis never contains an equal sign

    regarding the specified value of the population parameter (e.g. H1 : =368 grams).

    Each of the following statements is an example of a null hypothesis and alter-

    native hypothesis:

    H0 : = 0 H1 : = 0

    H0 : 0 H1 : > 0

    H0 : 0 H1 : < 0

    (I) Directional hypothesis(a) H0: There is no difference between the average pulse rates of men and

    women.

    H1 : Men have lower average pulse rates than women do.

    (b) H0 : There is no relationship between exercise intensity and the re-

    sulting aerobic benefit.

    H1 : Increasing exercise intensity increases the resulting aerobic bene-

    fit.(c) H0 : The defendant is innocent.

    H1 : The defendant is guilty.

    (II) Non-directional hypothesis

    (a) H0 : Men and women have same verbal abilities.

    H1 : Men and women have different verbal abilities.

  • 7/27/2019 Fundamentals of Statistical Inference

    57/101

    Testing of Hypothesis 53

    (b) H0: The average monthly salary for management graduates with a

    4-year experience.

    H1 : The average monthly salary is not Rs.75, 000.

    (c) H0 : Older workers are more loyal to a company.

    H1 : Older workers may not be loyal to a company.

    3. Determine the appropriate statistical test:

    After setting the hypothesis, the researcher has to decide on an appropriate sta-

    tistical test that will be used for statistical analysis. The tests of significance or

    test statistic are classified into two categories: parametric and non-parametric

    tests. Parametric tests are more powerful because their data are derived from

    interval and ratio measurements. Nonparametric tests are used to test hypothe-

    ses with nominal and ordinal data. Parametric techniques are the tests of choice

    provided certain assumptions are met. Assumptions for parametric tests are as

    follows:

    i. The selection of any element (or member) from the population should notaffect the chance for any other to be included in the sample to be drawn

    from the population.

    ii. The samples should be drawn from normally distributed population.

    iii. Populations under study should have equal variances.

    Non-parametric tests have few assumptions and do not specify normally dis-

    tributed populations or homogeneity of variance.

    Selection of a test:

    For choosing a particular test of significance following three factors are consid-

    ered:

    a. Whether the test involves one sample, two samples or k samples?

    b. Whether samples used are independent or related?

    c. Is the measurement scale nominal, ordinal, interval, or ratio?

  • 7/27/2019 Fundamentals of Statistical Inference

    58/101

    Testing of Hypothesis 54

    Further, it is also important to know: (i) sample size, (ii) The number of sam-

    ples, and their size, (iii) Whether data have been weighted. Such questions help

    in selecting an appropriate test statistic. One sample tests are used for single

    sample and to test the hypothesis that it comes from a specified population.

    The following questions need to be answered before using one sample tests

    a. Is there a difference between observed frequencies and the expected fre-

    quencies based on a statistical theory?

    b. Is there difference between observed and expected proportions?

    c. Is it reasonable to conclude that a sample is drawn from a population with

    some specific distribution (normal, Poisson, and so on).

    d. Is there significant difference between some measures of central tendency

    and its population parameter?

    The value of test statistic is calculated from the distribution of sample statistic

    by using the following formula

    Test Statistic =Value of sample statistic Value of hypothesized population parameter

    standardized error of the sample statistic

    The choice of a probability distribution of a sample statistic is guided by the

    sample size n and the value of population standard deviation n as shown below

    Sample size Population standard deviation

    . . . . . . Known Unknown

    n > 30 Normal distribution Normal distribution

    n 30 Normal distribution t-distribution

    4. Level of significance: This is admissible level of error at which we test the null

    hypothesis. The level of significance generally denoted by is the probability,

  • 7/27/2019 Fundamentals of Statistical Inference

    59/101

    Testing of Hypothesis 55

    which is attached to a null hypothesis, which may be rejected even when it is

    true. The level of significance is also known as the size of the rejection region

    or the size of the critical region. It is very important to note that the level

    of significance must be determined before drawing the samples, so that the

    obtained result is free from the choice bias of a decision marker. The levels of

    significance which are generally applied by researchers are 0.01, 0.05, 0.10. It

    is specified in terms of the probability of null hypothesis H0 being wrong. In

    other words, the level of significance defines the likelihood of rejecting a null

    hypothesis when it is true, i.e. it is the risk a decision maker takes of rejecting

    the null hypothesis when it is really true. The guide provided by the statistical

    theory is that this probability must be small.

    5. Test statistic: This is constructed using the statistic used to estimate the popu-

    lation parameter on which the hypothesis is being tested. The value of the teste

    statistic decided whether to reject the hypothesis or not reject the hypothesis.

    6. Critical value: After constructing the test statistic, we need to obtain the critical

    value. This critical value divides the entire region into critical and non-critical

    region.

    7. Conclusion:At this stage, the calculated value of the test statistic is compared

    with the critical value and concluded accordingly. In recent times, p-value

    approach is prominent and these two methods will be discusses in detail in the

    next section.

    8. Power of the test: This decides the strength of the test in correctly rejecting thenull hypothesis. Its calculation will be discussed for each test separately using

    an example.

    2.5 One tail and two tail tests

    The form of the alternative hypothesis can be either one-tailed or two-tailed, depend-

    ing on what the analyst is trying to prove.

  • 7/27/2019 Fundamentals of Statistical Inference

    60/101

    Testing of Hypothesis 56

    2.5.1 One tailed test

    One tailed tests are further classified as right tailed and left tailed tests. Alternative

    hypothesis decides whether a test is right tailed or a left tailed. If the alternative

    hypothesis is of type > then, the test is classified as right tailed test and if the

    alternative hypothesis is of type < then, the test is classified as left tailed test. Note

    that the = sign should be always in null hypothesis (let us accept this). This is

    because, the test statistic is calculated under the assumption that the null hypothesis

    is true.

    2.5.2 Two tailed testWhen the alternative hypothesis is of type = then the test is classified as two tailedtest.

    2.6 Critical region and non-critical region

    The sampling distribution of the test statistic is divided into two regions,a region of

    rejection (sometimes called the critical region) and a region of non-rejection. If the

    test statistic falls into the region of non-rejection,you do not reject the null hypothesis.

    If the test statistic falls into the rejection region,you reject the null hypothesis.

    The region of rejection consists of the values of the test statistic that are unlikely

    to occur if the null hypothesis is true.These values are much more likely to occur

    if the null hypothesis is false.Therefore,if a value of the test statistic falls into this

    rejection region, you reject the null hypothesis because that value is unlikely if the

    null hypothesis is true. To make a decision concerning the null hypothesis, you first

    determine the critical value of the test statistic. The critical value divides the non-rejection region from the rejection region. Determining the critical value depends on

    the size of the rejection region.The size of the rejection region is directly related to

    the risks involved in using only sample evidence to make decisions about a population

    parameter.

  • 7/27/2019 Fundamentals of Statistical Inference

    61/101

    Testing of Hypothesis 57

    2.7 Errors in hypothesis testing

    A Type I error occurs if you reject the null hypothesis, H0, when it is true and should

    not be rejected. A Type I error is afalse alarm. The probability of a Type I error

    occurring is .

    A Type II error occurs if you do not reject the null hypothesis, H0, when it is

    false and should be rejected. A Type II error represents a missed opportunity to take

    some corrective action. The probability of a Type II error occurring is .

    Whenever we reject a null hypothesis, there is a chance that we have made a

    mistake i.e., that we have rejected a true statement. Rejecting a true null hypothesis

    is referred to as a Type I error, and our probability of making such an error is

    represented by the Greek letter alpha (). This probability, which is referred to as

    the significance level of the test, is of primary concern in hypothesis testing.

    On the other hand, we can also make the mistake of failing to reject a false null

    hypothesis this is a Type II error. Our probability of making it is represented by the

    Greek letter beta (). Naturally, if we either fail to reject a true null hypothesis or

    reject a false null hypothesis, we have acted correctly. The probability of rejecting

    afalse null hypothesis is called the power of the test. The four possibilities are shown

    in Table.

    Actual Situation

    Statistical decision H0 true H0 false

    Do not reject H0 Correct decision, Confidence= (1 ) Type-II error, P(Type IIerror) =

    Reject H0 Type-I error, P(Type Ierror) = Correct decision, Power= (1 )

    In hypothesis testing, there is a necessary trade-off between Type I and Type II

    errors: For a given sample size, reducing the probability of a Type I error increases the

  • 7/27/2019 Fundamentals of Statistical Inference

    62/101

    Testing of Hypothesis 58

    probability of a Type II error, and vice versa. The only sure way to avoid accepting

    false claims is to never accept any claims. Likewise, the only sure way to avoid

    rejecting true claims is to never reject any claims. Of course, each of these extreme

    approaches is impractical, and we must usually compromise by accepting a reasonable

    risk of committing either type of error.

    Complements of Type-I and Type-II Errors

    The confidence coefficient 1, is the probability that you will not reject the nullhypothesis, when it is true and should not be rejected.

    The power of a statistical test, 1 , is the probability that you will reject thenull hypothesis when it is false and should be rejected.

  • 7/27/2019 Fundamentals of Statistical Inference

    63/101

    Testing of Hypothesis 59

    2.8 Test for single mean

    In this section, we discuss two tests that are most common in testing a hypothesis

    built on population mean . The first one is the Z test and second the t test. Wediscuss these two tests in detail using appropriate examples. The selection of the test

    depends on the sample size of the study or on the value of the standard deviation

    (known or unknown case).

    Assumptions

    1. The variable under study is ratio or interval.2. The population follows normal distribution.

    3. Population variance 2: known (Z-test), Unknown (t-test).

    4. Responses are independent within the samples.

    2.8.1 Z-test for single mean- known case

    The procedure to use a Z-test is as follows:

    1. Null hypothesis: H0 : = (,)0.2. Alternative hypothesis: H0 : = ()0.3. Level of significance: = 0.05(0.01, 0.02, 0.10).

    4. Test Statistic: Under H0,

    Two tailed test:

    Z =|X 0|

    n

    N(0, 1)

    One tailed test:

    Z =X 0

    n

    N(0, 1)

    5. Comparison and Conclusion.

  • 7/27/2019 Fundamentals of Statistical Inference

    64/101

    Testing of Hypothesis 60

    2.8.2 Testing Using Excel

    A1 Null Hypothesis = 0

    A2 Level of Significance () 0.05

    A3 Population Standard Deviation

    A4 Sample Size n

    A5 Sample Mean X

    A6 Intermediate Calculations

    A7 Standard Error of the Mean

    n= A3/sqrt(A4)

    A8 Z test Statistic Z =X 0

    n

    N(0, 1) = (A5 A1)/A7

    A9 Two Tailed Test Alternative Hypothesis H1: = 0

    A10 Lower Critical Value =NORM.S.INV(A2/2)

    A11 Upper Critical Value =NORM.S.INV(1-A2/2)

    A12 p-Value 2* (1-NORM.S.DIST(ABS(A8), TRUE))

    A13 Left Tailed Test Alternative Hypothesis H1: < 0

    A14 Lower Critical Value =NORM.S.INV(A2)

    A15 p-Value NORM.S.DIST(A8, TRUE)

    A16 Right Tailed Test Alternative Hypothesis H1: > 0

    A17 Upper Critical Value =NORM.S.INV(1-A2)

    A18 p-Value 1-(NORM.S.DIST(A8, TRUE))

    A19 Conclusion

    A20 Reject or Do not reject H0 IF(A12 < A2,Reject H0, Do not reject H0)

  • 7/27/2019 Fundamentals of Statistical Inference

    65/101

    Testing of Hypothesis 61

    2.8.3 t-test for single mean- unknown case

    The procedure to use a Z-test is as follows:

    1. Null hypothesis: H0 : = (,)0.2. Alternative hypothesis: H1 : = ()0.

    3. Level of significance: = 0.05(0.01, 0.02, 0.10).

    4. Test Statistic: Under H0,

    Two tailed test:

    t =|X 0|

    Sn

    tn1 d.f.

    One tailed test:

    t =X 0

    Sn

    tn1 d.f.

    where

    S =ni=1(Xi X)

    2

    n 15. Comparison and Conclusion.

  • 7/27/2019 Fundamentals of Statistical Inference

    66/101

    Testing of Hypothesis 62

    2.8.4 Testing Using Excel

    A1 Null Hypothesis = 0

    A2 Level of Significance () 0.05

    A3 Sample Standard Deviation = S

    A4 Sample Size n

    A5 Degrees of Freedom (d.f.) n 1

    A6 Sample Mean X

    A7 Intermediate Calculations

    A8 Standard Error of the MeanS

    n= A3/sqrt(A4)

    A9 t test Statistic t =X 0

    Sn

    t(n1) d.f. = (A6 A1)/A7

    A10 Two Tailed Test Alternative Hypothesis H1:

    = 0

    A11 Lower Critical Value =T.INV(A2/2, A5)

    A12 Upper Critical Value =T.INV(1-A2/2, A5)

    A13 p-Value 2* (1-T.DIST(ABS(A9), A5, TRUE))

    A14 Left Tailed Test Alternative Hypothesis H1: < 0

    A15 Lower Critical Value =T.INV(A2, A5)

    A16 p-Value T.DIST(A9, A5, TRUE)

    A17 Right Tailed Test Alternative Hypothesis H1: > 0

    A18 Upper Critical Value =T.INV(1-A2, A5)

    A19 p-Value 1-(T.DIST(A9, A5, TRUE))

    A20 Conclusion

    A21 Reject or Do not reject H0 IF(A12 < A2,Reject H0, Do not reject H0)

  • 7/27/2019 Fundamentals of Statistical Inference

    67/101

    Testing of Hypothesis 63

    2.9 Test for single proportion

    In this section, we discuss the procedure used to test the significance of single pro-

    portion.

    Assumptions

    1. The population follows normal distribution.

    2. The condition np 5, n(1 p) 5 is satisfied. This condition is necessary toapproximate the sampling distribution of the statistic to normal law.

    Steps in using the test

    1. Null hypothesis: H0 : P = (,)P0.2. Alternative hypothesis: H1 : P = ()P0.3. Level of significance: = 0.05(0.01, 0.02, 0.10).

    4. Test Statistic: Under H0,

    Two tailed test:Z =

    |P P0|

    P0(1P0)n

    N(0, 1)

    One tailed test:

    Z =P P0|P0(1P0)

    n

    N(0, 1)

    5. Comparison and Conclusion.

  • 7/27/2019 Fundamentals of Statistical Inference

    68/101

    Testing of Hypothesis 64

    2.9.1 Testing Using Excel

    A1 Null Hypothesis P = P0

    A2 Level of Significance () 0.05

    A3 Number of items of Interest X

    A4 Sample Size n

    A5 Intermediate Calculations

    A6 Sample Proportion Xn

    = A3/A4

    A7 Standard Error

    P0 (1 P0)

    n= sqrt((A1*(1-A1))/A4)

    A8 Z test Statistic Z =P P0

    P0 (1 P0)n

    N(0, 1) = (A6 A1)/A7

    A9 Two Tailed Test Alternative Hypothesis H1: = 0

    A10 Lower Critical Value =NORM.S.INV(A2/2)

    A11 Upper Critical Value =NORM.S.INV(1-A2/2)

    A12 p-Value 2* (1-NORM.S.DIST(ABS(A8), TRUE))

    A13 Left Tailed Test Alternative Hypothesis H1: < 0

    A14 Lower Critical Value =NORM.S.INV(A2)

    A15 p-Value NORM.S.DIST(A8, TRUE)

    A16 Right Tailed Test Alternative Hypothesis H1: > 0

    A17 Upper Critical Value =NORM.S.INV(1-A2)

    A18 p-Value 1-(NORM.S.DIST(A8, TRUE))

    A19 Conclusion

    A20 Reject or Do not reject H0 IF(A12 < A2,Reject H0, Do not reject H0)

  • 7/27/2019 Fundamentals of Statistical Inference

    69/101

    Testing of Hypothesis 65

    2.10 Comparison and conclusion

    Two tailed test:

    1. Critical value approach:

    Find the Table value at chosen level of significance . Compare this value with

    the calculated value.

    (a) Ifcal tab then, do not reject the null hypothesis.

    (b) Ifcal > tab then, reject the null hypothesis.

    2. p-value approach:

    Compute the p-value at chosen level of significance.

    (a) Ifp then, do not reject the null hypothesis.

    (b) Ifp > then, reject the null hypothesis.

    One tailed test:

    1. Right tailed test:

    (a) Critical value approach:

    Find the Table value at chosen level of significance . Compare this value

    with the calculated value.

    i. Ifcal tab then, do not reject the null hypothesis.

    ii. Ifcal > tab then, reject the null hypothesis.

    (b) p-value approach:

    Compute the p-value at chosen level of significance.

    i. Ifp then, do not reject the null hypothesis.ii. Ifp > then, reject the null hypothesis.

  • 7/27/2019 Fundamentals of Statistical Inference

    70/101

    Testing of Hypothesis 66

    2. Left tailed test

    (a) Critical value approach:

    Find the Table value at chosen level of significance . Compare this value

    with the calculated value.

    i. Ifcal > tab then, do not reject the null hypothesis.

    ii. Ifcal tab then, reject the null hypothesis.

    (b) p-value approach: Compute the p-value at chosen level of significance.

    i. Ifp then, do not reject the null hypothesis.ii. Ifp > then, reject the null hypothesis.

  • 7/27/2019 Fundamentals of Statistical Inference

    71/101

    Chapter 3

    Testing of hypothesis-Two sample

    problem

    3.1 Introduction

    In this chapter, we discuss the testing procedures used to test the significant difference

    between parameters belonging to two independent populations. We note that, there

    are several cases which will be discussed in the following sections.

    3.2 Assumptions

    In this section, we give some important points regarding the testing procedures used

    in two sample problem.

    1. The variable under study is ratio or interval.

    2. The population follows normal distribution.

    3. Population variances are equal i.e. 21 = 22.

    4. Samples are independent.

    5. Responses are independent within the samples.

    67

  • 7/27/2019 Fundamentals of Statistical Inference

    72/101

    Testing of hypothesis-Two sample problem 68

    3.3 Test for difference of means: Z-test

    1. Null hypothesis: H0 : 1 = (,)2.

    2. Alternative hypothesis: H1 : 1 = (>,

  • 7/27/2019 Fundamentals of Statistical Inference

    73/101

    Testing of hypothesis-Two sample problem 69

    3.3.1 Testing Using Excel: 21 = 22 =

    2 (known)

    A1 Null Hypothesis 1 = 2

    A2 Level of Significance () 0.05

    A3 Sample Mean1 X1

    A4 Sample Mean2 X2

    A5 Sample Size1 n1

    A6 Sample Size2 n2

    A7 Population Standard deviation

    A8 Intermediate Calculations

    A9 S.E. of difference of Means

    1

    n1+

    1

    n2

    = A7/sqrt((1/A5)+(1/A6))

    A10 Z test Statistic Z =X1 X2

    1

    n1 +1

    n2 N(0, 1) = (A3 A4)/A9

    A11 Two Tailed Test Alternative Hypothesis H1: 1 = 2

    A12 Lower Critical Value =NORM.S.INV(A2/2)

    A13 Upper Critical Value =NORM.S.INV(1-A2/2)

    A15 p-Value 2* (1-NORM.S.DIST(ABS(A10), TRUE))

    A16 Left Tailed Test Alternative Hypothesis H1:1 < 2

    A17 Lower Critical Value =NORM.S.INV(A2)

    A18 p-Value NORM.S.DIST(A10, TRUE)

    A19 Right Tailed Test Alternative Hypothesis H1: 1 > 2

    A20 Upper Critical Value =NORM.S.INV(1-A2)

    A21 p-Value 1-(NORM.S.DIST(A10, TRUE))

    A22 Conclusion Reject or Do not reject H0

  • 7/27/2019 Fundamentals of Statistical Inference

    74/101

    Testing of hypothesis-Two sample problem 70

    3.3.2 Testing Using Excel: Unequal Variances (Known)

    A1 Null Hypothesis 1 = 2

    A2 Level of Significance () 0.05

    A3 Sample Mean1 X1

    A4 Sample Mean2 X2

    A5 Sample Size1 n1

    A6 Sample Size2 n2

    A7 Population Standard deviation1 1

    A8 Population Standard deviation2 2

    A9 Intermediate Calculations

    A10 Standard Error of the Mean 21n1

    +22n2

    = sqrt((A72/A5)+(A82/A6))

    A11 Z test Statistic Z = X1 X221n1

    +22n2

    N(0, 1) = (A3 A4)/A10

    A12 Two Tailed Test Alternative Hypothesis H1: 1 = 2

    A13 Lower Critical Value =NORM.S.INV(A2/2)

    A14 Upper Critical Value =NORM.S.INV(1-A2/2)

    A15 p-Value 2* (1-NORM.S.DIST(ABS(A11), TRUE))

    A16 Left Tailed Test Alternative Hypothesis H1:1 < 2

    A17 Lower Critical Value =NORM.S.INV(A2)

    A18 p-Value NORM.S.DIST(A11, TRUE)

    A19 Right Tailed Test Alternative Hypothesis H1: 1 > 2

    A20 Upper Critical Value =NORM.S.INV(1-A2)

    A21 p-Value 1-(NORM.S.DIST(A11, TRUE))

    A22 Conclusion Reject or Do not reject H0

  • 7/27/2019 Fundamentals of Statistical Inference

    75/101

    Testing of hypothesis-Two sample problem 71

    3.4 Test for difference of means:t-test

    1. Null hypothesis: H0 : 1 = (,)2.

    2. Alternative hypothesis: H0 : 1 = ()2.

    3. Level of significance: = 0.05(0.01, 0.02, 0.10).

    4. Test Statistic: Under H0,

    (a) When the assumption of equality of variances is satisfied (21 = 22 =

    2)

    and 2 is unknown.

    Two tailed test:

    t =|X1 X2|

    1n1

    + 1n2

    tn1+n22 d.f.

    One tailed test:

    t =X1 X2

    1n1

    + 1n2

    tn1+n22 d.f.

    (b) When the assumption of equality of variances is not satisfied (21 =

    22).

    Two tailed test:

    t =|X1 X2|

    S21

    n1+

    S22

    n2

    tn1+n22 d.f.

    One tailed test:

    t =X1 X2

    S21

    n1+

    S22

    n2

    tn1+n22 d.f.

    where

    S21 =

    ni=1(Xi X1)2

    n1 1 and S22 =

    ni=1(Yi X2)2

    n2 15. Comparison and Conclusion.

  • 7/27/2019 Fundamentals of Statistical Inference

    76/101

    Testing of hypothesis-Two sample problem 72

    3.4.1 Testing Using Excel: 21 = 22 =

    2 (Unknown)

    A1 Null Hypothesis 1 = 2

    A2 Level of Significance () 0.05

    A3 Sample Mean1 X1

    A4 Sample Mean2 X2

    A5 Sample Size1 n1

    A6 Sample Size2 n2

    A7 Sample Standard deviation1 S1

    A8 Sample Standard deviation2 S2

    A9 Pooled Estimate S = sqrt ((A5 A72 + A6 A82)/(A5 + A6 2))

    A10 Intermediate Calculations

    A11 S.E. of difference of Means S

    1n1

    + 1n2

    = A9/sqrt((1/A5)+(1/A6)

    A12 t test Statistic t =X1 X2

    S

    1

    n1+

    1

    n2

    tn1+n22 = (A3-A4)/A11

    A13 Two Tailed Test Alternative Hypothesis H1: 1 = 2

    A14 Lower Critical Value =T.INV(A2/2, (A5+A6-2))

    A15 Upper Critical Value =T.INV(1-A2/2, (A5+A6-2))

    A16 p-Value 2* (1-T.DIST(ABS(A12), (A5+A6-2), TRUE))

    A17 Left Tailed Test Alternative Hypothesis H1:1 < 2

    A18 Lower Critical Value =T.INV(A2, (A5+A6-2))

    A19 p-Value T.DIST(A12, (A5+A6-2), TRUE)

    A20 Right Tailed Test Alternative Hypothesis H1: 1 > 2

    A21 Upper Critical Value =T.INV(1-A2, (A5+A6-2))

    A22 p-Value 1-(T.DIST(A12, (A5+A6-2), TRUE))

    A23 Conclusion Reject or Do not reject H0

  • 7/27/2019 Fundamentals of Statistical Inference

    77/101

    Testing of hypothesis-Two sample problem 73

    3.4.2 Testing Using Excel: Unequal Variances (Unknown)

    A1 Null Hypothesis 1 = 2

    A2 Level of Significance () 0.05

    A3 Sample Mean1 X1

    A4 Sample Mean2 X2

    A5 Sample Size1 n1

    A6 Sample Size2 n2

    A7 Sample Standard deviation1 S1

    A8 Sample Standard deviation2 S2

    A9 Intermediate Calculations

    A10 S.E. of difference of Means

    S21n1

    +S22n2

    = sqrt((A72/A5)+(A82/A6))

    A11 t test Statistic t =X1 X2

    S21n1

    +S22n2

    tn1+n22 = (A3-A4)/A10

    A12 Two Tailed Test Alternative Hypothesis H1: 1 = 2

    A13 Lower Critical Value =T.INV(A2/2, (A5+A6-2))

    A14 Upper Critical Value =T.INV(1-A2/2, (A5+A6-2))

    A15 p-Value 2* (1-T.DIST(ABS(A11), (A5+A6-2), TRUE))

    A16 Left Tailed Test Alternative Hypothesis H1:1 < 2

    A17 Lower Critical Value =T.INV(A2, (A5+A6-2))

    A18 p-Value T.DIST(A11, (A5+A6-2), TRUE)

    A19 Right Tailed Test Alternative Hypothesis H1: 1 > 2

    A20 Upper Critical Value =T.INV(1-A2, (A5+A6-2))

    A21 p-Value 1-(T.DIST(A11, (A5+A6-2), TRUE))

    A22 Conclusion Reject or Do not reject H0

  • 7/27/2019 Fundamentals of Statistical Inference

    78/101

    Testing of hypothesis-Two sample problem 74

    3.5 Test for difference of two proportions

    1. Null hypothesis: H0 : P1 = (,)P2.

    2. Alternative hypothesis: H0 : P1 = ()P2.

    3. Level of significance: = 0.05(0.01, 0.02, 0.10).

    4. Test Statistic: Under H0,

    Two tailed test:

    Z = |P1

    P2

    |P1(1P1)n1

    + P2(1P2)n2

    N(0, 1)One tailed test:

    Z =P1 P2

    P1(1P1)n1

    + P2(1P2)n2

    N(0, 1)

    Test statistic using a pooled estimate:

    Z =|P1 P2|

    P(1 P) 1n1 + 1n2

    N(0, 1)

    One tailed test:

    Z =P1 P2

    P(1 P)

    1n1

    + 1n2

    N(0, 1)where,

    P =n1P1 + n2P1

    n1 + n2

    5. Comparison and Conclusion.

  • 7/27/2019 Fundamentals of Statistical Inference

    79/101

    Testing of hypothesis-Two sample problem 75

    3.5.1 Testing Using Excel: Test for Difference of Proportions

    A1 Null Hypothesis P1 = P2

    A2 Level of Significance () 0.05

    A3 Sample Size1 n1

    A4 Sample Size2 n2

    A5 Number of items of interest X1

    A6 Number of items of interest X1

    A7 Sample Proportion1 P1 = A5/A3

    A8 Sample Proportion2 P2 = A6/A4

    A9 Pooled Estimate P = ((A3 A7 + A4 A8)/(A3 + A4))

    A10 Intermediate Calculations

    A11 S.E. sqrt(A9 (1-A9)(((1/A5)+(1/A6)))

    A12 Z test Statistic Z=(A7-A8)/A11

    A13 Two Tailed Test Alternative Hypothesis H1: P1 = P2

    A14 Lower Critical Value =NORM.S.INV(A2/2)

    A15 Upper Critical Value =NORM.S.INV(1-A2/2)

    A16 p-Value 2* (1-NORM.S.DIST(ABS(A12), TRUE))

    A17 Left Tailed Test Alternative Hypothesis H1:P1 < P2

    A18 Lower Critical Value =NORM.S.INV(A2)

    A19 p-Value NORM.S.DIST(A12, TRUE)

    A20 Right Tailed Test Alternative Hypothesis H1: P1 > P2

    A21 Upper Critical Value =NORM.S.INV(1-A2)

    A22 p-Value 1-(NORM.S.DIST(A12, TRUE))

    A23 Conclusion Reject or Do not reject H0

  • 7/27/2019 Fundamentals of Statistical Inference

    80/101

    Testing of hypothesis-Two sample problem 76

    3.6 Test for dependent samples

    In previous section, we have discussed the testing procedure used to test the hypoth-

    esis constructed on difference between two population means, when the samples are

    independent. In this section, we discuss a testing procedure, when the samples are

    dependent. This is the case where the responses are taken from the same set of in-

    dividuals before an experiment and after the experiment. This is also used when the

    samples are matched samples.

    Suppose that, in a marketing research, the researcher wants to know the opinionof the customers on his companys product. He selects a sample of customers, say of

    size n, from a population and collects the opinion from these n customers and then he

    introduces the same product with some additions to it. He requests the customers to

    use the product and collects the response from them after a month. Here the variable

    measured is the weight of the customers. In this case the researcher is interested to

    test the hypothesis Is there any significant difference between the average weight of

    the customers before and after the additions to the product?

    1. Null hypothesis: H0 : 1 = (,)2 or D = 1 2 = (,)0.2. Alternative hypothesis: H1 : 1 = (>, ,

  • 7/27/2019 Fundamentals of Statistical Inference

    81/101

    Testing of hypothesis-Two sample problem 77

    3.6.1 Testing Using Excel

    A1 Null Hypothesis D : 1 2 = 0

    A2 Level of Significance () 0.05

    A3 Sample Mean d

    A4 Sample Size (before) n1

    A5 Sample Size (after) n2

    A6 Sample Standard deviation d

    A7 Intermediate Calculations

    A8 S.E.d

    n= A6/sqrt(A5)

    A9 t test Statistic t =ddn

    tn1 d.f. = (A3)/A8

    A10 Two Tailed Test Alternative Hypothesis H1: D : 1

    2

    = 0

    A11 Lower Critical Value =NORM.S.INV(A2/2)

    A12 Upper Critical Value =NORM.S.INV(1-A2/2)

    A13 p-Value 2* (1-NORM.S.DIST(ABS(A9), TRUE))

    A14 Left Tailed Test Alternative Hypothesis H1:D : 1 2 < 0

    A15 Lower Critical Value =NORM.S.INV(A2)

    A16 p-Value NORM.S.DIST(A9, TRUE)

    A17 Right Tailed Test Alternative Hypothesis H1: D : 1 2 > 0

    A18 Upper Critical Value =NORM.S.INV(1-A2)

    A19 p-Value 1-(NORM.S.DIST(A9), TRUE))

    A23 Conclusion Reject or Do not reject H0

  • 7/27/2019 Fundamentals of Statistical Inference

    82/101

    Testing of hypothesis-Two sample problem 78

    3.7 Test for difference of variances-F Test

    This is an important test which is used to test the hypothesis H0 : 21 = 22.

    1. Null hypothesis: H0 : 21 = (, 22 .

    2. Alternative hypothesis: H1 : 21 = (>, tab then, reject the null hypothesis.

    2. p-value approach:

    Compute the p-value at chosen level of significance.

    (a) Ifp then, do not reject the null hypothesis.

    (b) Ifp > then, reject the null hypothesis.

    One tailed test:

    1. Right tailed test:

    (a) Critical value approach:

    Find the Table value at chosen level of significance . Compare this value

    with the calculated value.

  • 7/27/2019 Fundamentals of Statistical Inference

    83/101

    Testing of hypothesis-Two sample problem 79

    i. Ifcal tab then, do not reject the null hypothesis.ii. Ifcal > tab then, reject the null hypothesis.

    (b) p-value approach:

    Compute the p-value at chosen level of significance.

    i. Ifp then, do not reject the null hypothesis.ii. Ifp > then, reject the null hypothesis.

    2. Left tailed test

    (a) Critical value approach:

    Find the Table value at chosen level of significance . Compare this value

    with the calculated value.

    i. Ifcal > tab then, do not reject the null hypothesis.

    ii. Ifcal tab then, reject the null hypothesis.

    (b) p-value approach: Compute the p-value at chosen level of significance.

    i. Ifp then, do not reject the null hypothesis.ii. Ifp > then, reject the null hypothesis.

  • 7/27/2019 Fundamentals of Statistical Inference

    84/101

    Chapter 4

    Chi-Square tests

    4.1 IntroductionThe statistical-inference techniques presented so far have dealt exclusively with hy-

    pothesis tests and confidence intervals for population parameters, such as population

    means and population proportions. In this chapter, we consider three widely used in-

    ferential procedures that are not concerned with population parameters. These three

    procedures are often called chi-square procedures because they rely on a distribution

    called the chi-square distribution.

    The distribution is also important in discrete hedging of options in finance, as

    well as option pricing. This distribution is used to construct the confidence interval

    for population variance 2. Also note that this distribution is derived from normal

    distribution. Square of a standard normal variate gives a chi-square random variable

    with 1 degrees of freedom. Similarly if we square n standard normal random variables

    and add them, we get a chi-square distribution with n degrees of freedom.

    The tests discussed in this chapter have wide applicability