stats notes

59
Statistics for Management Manual Dr. Seema Sharma

Upload: syedhabeeb12766

Post on 11-May-2017

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Stats Notes

Statistics for Management

Manual

Dr. Seema Sharma

Page 2: Stats Notes

Statistics for Management(SMV 793)

Course Coordinator: Dr. Seema Sharma

IntroductionToday's business decisions are driven by data. Business managers and professionals are increasingly encouraged to justify decisions on the basis of data. Statistical Tools help them in making business decisions under uncertainties. Business managers need statistical model-based decision support systems, which include applications of descriptive statistics, probability model, interval estimates, hypothesis testing and forecasting techniques to management problems.

Objectives The entire course structure of this paper is designed in a way to meet the following objectives:

To provide the student with an understanding of the application of statistics in business and research situations.

To provide the student with the ability to intelligently analyze data in order to apply the relevant quantitative skills needed to make effective management decisions.

To provide the student with the ability to interpret and explain the results of various statistical tests.

By the end of this course students will be able to apply statistical concepts and methodologies while performing data analysis.

Course StructureNature and role of statistics for management. Descriptive Statistics: Measures of Central Tendency, Measures of Dispersion. Introduction to probability theory. Probability Theory: Preliminary concepts in Probability, Basic Theorems and rules for dependent/independent events, Random Variable, Expected value and, Variance of random Variable. Probability distributions. Sampling distributions. Estimation and hypothesis testing: t-tests, ANOVA, Chi-square tests, Non-parametric statistics. Correlation and regression analysis. Introduction to SPSS and its use for statistical modeling.

Lecture Plan:Week 1

Nature and role of statistics for management.

Week 2Descriptive Statistics: Measures of Central Tendency.

Week 3Measures of Dispersion.

Week 4 & 5Introduction to probability theory. Probability Theory: Preliminary concepts in Probability.

Week 6, 7Basic Theorems and rules for dependent/independent events, Random Variable, Expected value and, Variance of random Variable.

Week 8, 9Probability distributions. Sampling distributions.

Week 10, 11, 12Estimation and hypothesis testing: t-tests, ANOVA, Chi-square tests, Non-parametric statistics.

Page 3: Stats Notes

Week 13 Correlation and regression analysis.

Week 14Introduction to SPSS and its use for statistical modeling.

Major Test

EvaluationApart from the major exam at the end of the semester, students would be examined throughout the semester via minor exams, case studies and home assignments. The distribution of the marks would be as follows:

Major Exam 60 MarksTwo Quizs 20 MarksAssignments & 20 MarksClass Participation

Suggested Readings

1. Richard I. Levin and David S. Rubin “Statistics for Management", PHI, New Delhi 1997.

2. Anderson, Sweeney and Williams," Statistics for Business and Economics", South-Western College Publishing, Ohio 1998.

3. Murray R. Spiegel, “Theory and Problems of STATISTICS", 2/ed. Schuam's Outline Series, McGraw-Hill Book Company, London 1992.

4. Mark L. Berenson and David M. Levine, " Basic Business Statistics: Concepts and Applications", 5/ed, Prentice Hall Englewood Cliffs, New Jersey 1992.

5. R.P. Hooda, “Statistics for Business and Economics", MacMillan India Ltd., New Delhi 1997.

6. S.C. Gupta," Fundamentals of Statistics", HPI, New Delhi 1996.

Page 4: Stats Notes

STATISTICSStatistics refers to the body of techniques used for collecting, organizing, presenting & analyzing the data as well as drawing valid conclusions & making reasonable decisions on the basis of such analysis. The data may be quantitative, with values expressed numerically, or it may be qualitative, with the characteristics of observations being tabulated.

ORIGIN AND DEVELOPMENTThe great statisticians, Sir Francis Galton (1822-1925), Karl Pearson (18571936) and W.S. Gosset had pioneered the regression analysis, the correlation analysis as well as chi-square test and t-test respectively. Ronald A. Fisher who is rightly termed as Father of Statistics has developed statistics to a variety of fields such as biometry, genetic, psychology, education and agriculture. He is also a pioneer in Estimation Theory, Sampling Distribution Theory, Analysis of Variance and Design of Experiments. For his contributions to statistics, Fisher is described as the real giant in the development of the THEORY OF STATISTICS.

The Role of Statistics in Managerial Decision-MakingA Manager has to deal with statistics while facing following four situations:

(i) When data need to be presented in a form which helps in easy grasping (e.g. presentation of performance data in graphs, charts, tables in the annual reports of a company).

(ii) When some unknown statistical relationships have to be tested.(iii) When some hypothesis testing has to be made and inferences have

to be drawn.(iv) When a decision has to be made under uncertainty regarding a

course of action to be followed.

The role, statistics can play in managerial decision-making is indicated in the flow diagram in Figure on the next page. Every managerial decision-making problem begins with a real-world problem. This problem is then formulated in managerial terms and framed as a managerial question. The next sequence of steps (proceeding counterclockwise around the flow diagram) identifies the role that statistics can play in this process. The managerial question is translated into a statistical question, the sample data are collected and analysed, and the statistical question is answered. The next step in the process is using the answer to the statistical question to reach an answer to the managerial question. The answer to the managerial question may suggest a reformulation of the original managerial problem, suggest a new managerial question, or lead to the solution of the managerial problem.

Page 5: Stats Notes

One of the most difficult steps in the decision-making process-one that requires a cooperative effort among managers and statisticians-is the translation of the managerial question into statistical. This statistical question must be formulated so that, when answered, it will provide the key to the answer to the managerial question.

REAL-WORLD PROBLEM

Managerial Formulation of problem

Managerial question relating to problem

Statistical formulation of question

Managerial solution to problem

Answer to Managerial question

Answer toStatistical question

STATISTICAL ANALYSIS

New question

Page 6: Stats Notes

DESCRIPTIVE AND INFERENTIAL STATISTICS

Descriptive statistics include the techniques that are used to summarize and describe numerical data. Inferential statistics include those techniques by which decisions about a statistical population or process are made based only on a sample having been observed. Because such decisions are made under conditions of uncertainty, the use of probability concepts is required. Whereas the measured characteristics of a sample are called sample statistics, the measured characteristics of a statistical population, or universe, are called population parameters.

DISCRETE AND CONTINUOUS VARIABLESA discrete variable can have observed values only at isolated points along a

scale of values. In business statistics, such data typically occur through the process of counting; hence, the values generally are expressed as integers (whole numbers). A continuous variable can assume a value at any fractional point along a specified interval of values. Continuous data are generated by the process of measuring.Example: Examples of discrete data are the number of persons per household, the units of an item in inventory, and the number of assembled components which are found to be defective. Examples of continuous data are the weight of a shipment, the length of time before the first failure of a device, and the average number of persons per household in a large community. Note that an average number of persons can be a fractional value and is thus a continuous variable, even though the number per household is a discrete variable.

FREQUENCY DISTRIBUTIONS

It is useful to distribute the data into classes & to determine the number of individuals belonging to each class, called class frequency. A tabular arrangement of data by classes together with the corresponding frequencies is called a frequency distribution.

- Individual Series- Discrete Series- Continuous Series

TYPES OF DATAStatistics is concerned with measurements of one or more variables of a sample of units drawn from a population. These measurements are referred to as data. Data is generally classified into four types:

Page 7: Stats Notes

Nominal Ordinal Interval Ratio

NominalNominal data (also referred to as categorical data) are labels or names that identify the category to which each unit belongs.Example: the gender of each individual in a sample of seven applicants for a computer programming jog.Nominal data are often reported as nonnumerical labels, such as male (or female) in our example.OrdinalIndicates the relative amount of property possessed by the units.Examples:1. The size of the car rented by each individual in a sample of 20 business

travelers: Compact, midsize or full-size.2. A taste-tester's ranking of four brands of tomato sauce.3. A supervisor's annual ranking of the performance of his 30 employees

using a scale of 1 (worst performance) to 10 (best performance).Hence ordinal data simply provides the ordering or ranking of the units in a sample or population.

IntervalThe interval data are measurements that enable the determination of how much more or less of the measured characteristic is possessed by one unit than another.Example:The temperature (in degrees Fahrenheit) at which each of a sample of 20 pieces of heat-resistant plastic begins to melt.The key feature of this type of data is that the zero point (origin) does not indicate the absence of the characteristic of interest e.g. the origin on the temperature scale does not indicate the absence of heat. Temperatures lower than 0o (e.g. -10oC and -10oF) indicate that less heat is present, so 0o does not mean 'no heat'.

RatioRatio data are measurements that enable the determination of how many times as much of the measured characteristic is possessed by one unit than another. Ratio data are always numerical. Zero point or origin indicates the absence of the characteristic measured.Examples

Page 8: Stats Notes

1. Sales revenue for each firm in a sample of 20 firms.2. The number of unemployed people in a city or unemployment rate.3. The number of cars sold in a country in a particular year.

The ratio data represents the highest level of measurement. Most numerical business data are measured on scales for which the origin is meaningful. Thus, most numerical measurements encountered in business are ratio data.

The four types of data are often combined into two classes that are sufficient for most statistical applications. Nominal and ordinal data are often referred to as qualitative data, whereas interval and ratio data are called quantitative data.

Question for self-study:Q. Suppose you are provided a data set that classifies each sample unit into one of four categories: A, B, C or D. You plan to create a computer database consisting of these data, and you decide to code the data as A = 1, B = 2, C = 3 and D = 4 for entering them into the computer. Are the data consisting of the classification A, B, C and D qualitative or quantitative? After the data are entered as 1, 2, 3 and 4, are they qualitative or quantitative? Explain your answer.

DESCRIPTIVE MEASURES1. Measures of Central Tendency

Arithmetic Mean ( ) Median Mode

2. Measures of Dispersion Absolute Measures Relative Measures

A Measure of Central Tendency is the single value, which represents the entire series of data.Discussion Area:1. How to compute different measures of central tendency in case of different

frequency distridutions?2. Is there any relationship between the different measures?

Measures of DispersionThe degree to which individual values tend to scatter around the average value is called the dispersion or variation of the data.

Page 9: Stats Notes

Absolute Measures: Range Standard Deviation Variance

Relative Measure: Coefficient of Variation

Absolute measures depend on unit of measurement of data whereas the relative measures are independent of the same. Therefore, relative measure unlike the absolute measure can be used to compare the variability or uniformity of the two or more distributions.

Standard Deviation:

Ungrouped Data

S.D. = =

Grouped Data

S.D. = =

Variance = = (S.D)2

Coefficient of Variance (C.V.):

C.V. =

Problems for self study

Page 10: Stats Notes

1. Following is a sample of yields for 10 shares traded on the New York Stock Exchange.

Issuer Yields (%) Issuer Yield (%)Argosy 12.6 Caterpillar 6.3Chase Manhattan 6.7 Dow 6.8IBM 7.0 Lucent 6.7Mobil 7.3 Pacific Bell 6.7RJR Nabisco 8.1 Service Mdse 8.6

Compute the following descriptive statistics.i. Mean, median, and mode

ii. The variance and standard deviation.iii. Coefficient of variation.

2. A survey was conducted concerning the ability of computer manufacturers to handle problems quickly. The following results were obtained.

CompanyDays toResolve

ProblemsCompany

Days toResolve

ProblemsCompaq 13 Gateway 21Packard Bell 27 Digital 27Quantex 11 IBM 12Dell 14 Hewlett-Packard 14NEC 14 AT&T 20AST 17 Toshiba 37Acer 16 Micron 17

a. What are the mean and median number of days needed to resolve problems?

b. What is the variance and standard deviation?c. Which manufacturer has the best record?

3. Public transportation and the automobile are two methods an employee can use to get to work each day. Samples of times recorded for each method are shown. Times are in minutes.

Public Transportation 28 29 32 37 33 25 29 32 41 34Automobile: 29 31 33 32 34 30 31 32 35 33

a. Compute the sample mean time to get to work for each method.b. Compute the sample standard deviation for each method.

Page 11: Stats Notes

c. On the basis of your results from (a) and (b), which method of transportation should be preferred? Explain.

PROBABILITY THEORY

Preliminary ConceptsRandom Experiment: An experiment, which can result into, more than one outcome is called Random Experiment or Statistical Experiment.

Event: Each distinct outcome of an experiment is called a simple event.

Sample Space: Collection of all possible distinct outcomes of an experiment is called the sample space of outcomes.

e.g. tossing an unbiased coin is a random experiment.Sample Space = {H,T) Whereas, H, T are two events.In case of two unbiased coins:Sample Space = {HH, HT, TH, TT}

Mutually Exclusive Events: Two or more events are called mutually exclusive if the occurrence of any one of them excludes the occurrence of the others. e.g. in an experiment of tossing of a coin, occurrence of head and tail are mutually exclusive. Upcoming of head excludes the outcome of tail and vice versa.

Equally Likely Events: If events are said to be equally likely if none of them is expected to occur in preference to other. If we roll a die, any number out of 1, 2, 3, ….6 can come. Therefore, all six numbers are equally likely to come or have equal chances of selection.

Independent Events: Two events are said to be independent if happing of one is not affected by and does not effect the happening of the other event. e.g. suppose we have a bag with 5 red and 5 green balls. Now suppose a green ball is taken out and then replaced also. Thereafter, a red ball is taken out. Now the outcome of the red ball is not getting affected due to the previous experiment involving the green ball, therefore the two events i.e. outcome of green ball first and then the red ball are independent.If the green ball had not replaced back then the chances of occurrence of red ball have been affected because the total balls left in the bag are 9 instead of 10.

Definitions of Probability Classical or Mathematical Probability:

Page 12: Stats Notes

If a random experiment results in N exhaustive, mutually exclusive and equally likely outcomes, out of which m are favourable to the event A, then, the probability of occurrence of A will be:-

P (A) =

Probability of Non Occurrence of A will be

P ( ) = 1 – P (A)

P ( ) + P (A) = 1

Statistical or Empirical Probability:If an experiment is performed repeatedly under homogeneous conditions, then the limiting value of the ratio of the number of times the event occurs to the number of trials, as the number of trials increases indefinitely, is called the probability of happening of the event.

P(A) =

LAWS OF PROBABILITY

Suppose we define two events A and B on a sample space say, S.Then the addition and multiplicative law can be stated as :

Law of Addition:

mn

= Favourable Cases to AExhaustive Cases

M: favourable cases to AN: Exhaustive Cases

AB

S

(AB)

Page 13: Stats Notes

(i) P (AUB) = P(A) + P (B) Events A and B are mutually exclusive(ii) P (AB) = P (A) + P (B) – P (AB) Events are not mutually

exclusive

MULTIPLICATIVE LAW:(i) P (AB) = P (A) P (B) If A, B are Independent(ii) P (AB) = P (A). P (B/A)

P (AB) = P(B). P (A/B)

P (B/A) and P (A/B) known as condition probabilities.

ExampleQ. A bag contains 4 white, 5 red and 6 Green ablls. Three balls are drawn at random. What is the chance that a white, a red and a green ball is drawn ?Solution:

There are 4+ 5 +6 = 15 balls in the bag. Three balls can be drawn out

of 15 in ways.

One white ball can be drawn out of the 4 white balls in 4C1, ways ; one red ball can be drawn out of the 5 red balls in 5C1, ways and one green ball can be drawn out of the 6 green balls in 6C1, ways.

Hence required probability =

Questions for self study

1. An urn contains 8 white and 3 red balls. If two balls are drawn at random, find the probability that (i) both are white, (ii) Both are red, (iii) one is of each colour.

2.The following data show the length of life of wholesale grocers in a particular city

Length of Life Percentage of (years) Wholesalers

If A, B are dependent events

4C1 * 5C1 *

6C1

Page 14: Stats Notes

0-5 655-10 16

10-15 915-25 525 and over 5

Total 100(i) During the period studied, what is the probability that an entrant to this

profession will fail within five years ?(ii) That he will survive at least 25 years?(ii) How many years would he have to survive to be among the

10 % percent longest survivors?

3. A Committee of 4 persons is to be appointed from 3 officers of the production department, 4 officers of the purchase department, two of the sales department and one chartered accountant. Find the probability of forming the committee in the following manner: (i) There must be one from each category.(ii) It should have atleast one from the purchase department.(iii) The chartered accountant must be in the committee.

4. A chartered accountant applies for a job in two firms X and Y. He estimates that the probability of his getting selected in firm X is 0.7, and being rejected at Y is 0.5 and the probability of at least one of his applications being rejected is 0.6. What is the probability that he will be selected in one of the firms ?

5. There are 3 economists, 4 engineers, 2 statisticians and I doctor. A committee of 4 from among them is to be formed. Find the probability that the committee :(i) Consists of one of each kind.(ii) Has at least one economist.(iii) Has the doctor as a member and three others.

6. Two vacancies exist at the junior executive level of a certain company. Twenty people, fourteen men and six women, are eligible and equally qualified. The company has decided to draw two names at random from the list of the eligible people. What is the probability that :(a) Both positions will be filled by women?(b) At least one of the positions will be filled by woman?(c) Neither of the positions will be filled by women.

Random Variable

Page 15: Stats Notes

Definition: A finite real valued measurable function defined on a sample space is called a random variable. Its value is determined by the outcome of its experiment.e.g. toss of two coins:

S = (HH, TH, HT, TT)would be the sample space of the experiment of tossing.Now say we toss two coins and X denotes the number of HeadsTherefore X will take four values as follows:

X = (0, 1, 1, 2)Types of the random variable:

-Discrete Random Variable (DRV)-Continuous Random Variable (CRV)

DRV assumes finite values whereas CRV takes all possible value in particular limits.

Question for self study:Q: Three coins are tossed simultaneously. A random variable X denotes the occurrence of two or more heads.

Write down the sample space of the entire experiment. What is the type of the random variable under study i.e. whether X is

discrete or continuous? What values can be assigned to the random variable?

Discrete Probability DistributionLet X be a discrete random variable and the possible values it can assume are:

x1, x2, x3, ……………, xn

Now suppose the corresponding probabilities to these values are given by: P(x1), p(x2), p(x3)………p(xn)Then, p(xi) can be referred to as a Discrete Probability Distribution if:(i) p(xi) 0(ii) p(x) = 1In simple words, tabular presentation of all different values of a discrete random variable along with its respective probabilities is known as Discrete Probability Distribution provided the above conditions are fulfilled. For example….Let X be a discrete random variable denoting the number of heads in a toss of two coins.Then the sample space of the experiment would be:S = ( TT, TH, HT, HH) And X will take the following values along with its respective probabilities:

X p(x)

Page 16: Stats Notes

0

1

2

In the above table we found that all p(x) s are greater than zero and sum total of all probabilities is one. Therefore the above distribution is a discrete probability distribution.Binomial Distribution and piosson Distribution are examples of discrete distributions.

Continuous Probability DistributionIf X be a continuous random variable and the possible values it can assume are: x1, x2, x3, ……………, xn

Now suppose the corresponding probabilities to these values are given by: p(x1), p(x2), p(x3)………p (xn)

Then, p(xi) can be referred to as a continuous Probability Distribution if:

(iii) p(xi) 0

(iv) p(x) dx= 1

Normal Distribution is the example of continuous probability distribution.

Expected Value of a Random VariableIf X is a DRV having the possible values x1, x2, x3, ….……., xn then the expected value of X is given by:E(X) = x1px1 + x2px2 + x3px3 + ….…….xnpxn

Hence,

E(X) = xip(xi) =

Page 17: Stats Notes

Where is known as mean of the random variable.Properties of Expectation:

(i) E(C) = C where C is a constant(ii) E(aX+b) = a E(X) + b where a and b are constants(iii) E(X+Y) = E(X) + E(Y) where X and Y are random variables

Variance of Random Variable =

=

=

=

=or

= i.e. = where

Example(i) A die is thrown at random. What is the expectation of the

number on it?SolutionLet X denotes the number on the die. Then X is a random variable which takes any one of the values 1, 2, 3, 4, 5, 6 each with equal probability 1/6 as given further:

Now, E(X) = XP(x)

X 1 2 3 4 5 6

P(x) 1/6 1/6 1/6 1/6 1/6 1/6

Page 18: Stats Notes

= 1*1/6 +2*1/6 +3*1/6 + 4*1/6 +5*1/6 + 6*1/6 = 1/6(1+2+3+4+5+6)= 21/6

Therefore, E(X) = 7/2

Problems for self study1. A random variable X is defined as the sum of faces when a pair of

dice is thrown. Find the expected value of x. Also find its variance.

2. An urn contains 7 white and 3 red balls. Two balls are drawn together, at random, from this urn. What is the expected number of white balls drawn?

3. A die is tossed twice. Getting a number greater than 4 is considered a success. Find the mean and variance of the probability distribution of the number of successes.

Normal DistributionNormal Distribution is one of the most important continuous theoretical distributions in Statistics.Definition:If X is a continuous random variable following Normal Probability Distribution with mean and standard deviation , then its probability density function (p.d.f.) is given by:

-∞ < < ∞

and are called the parameters of the distribution.

Where, Mean =

& Variance =

Page 19: Stats Notes

Properties of the Normal Distribution1. The graph of the Normal Probability Curve is bell shaped.

2. The curve is symmetrical on both axis.3. Since the distribution is symmetrical therefore,

Mean = Median = Mode (All coincide at a point).4. The whole area under the curve is divided into two equal parts.5. The maximum probability occurring at X = μ is given by:

AREAS UNDER NORMAL PROBABILITY CURVE

Page 20: Stats Notes

- 3 - 2 - X = + + 2 +399.73 %

The following table gives the areas under the normal probability curve for some important values of Z :

Distance from the meanOrdinates in terms of

Area under the curve

Z = 0.6745Z = 1.00Z = 1.96Z = 2.0Z = 2.58Z = 3.0

50% = 0.5068.26% = 0.6826 95% = 0.9595.44% = 0.9544 99% = 0.9999.73%= 0.997.3

How to Compute Areas under Normal Probability Curve? Mathematically, the area bounded by the curve pf(x), X-axis and the ordinates at X = a and X = b is given by the definite integral:

But since p (x) is probability density function, it is represented by

P (a<X<b) = =

X = a X = X = bLet us now try to compute the areas under the normal probability curve.

P ( <X<a) = p (x) d x

b

a

P(x) d(x)

b

a

P(x) d(x)

P (a < x < b)

a

Page 21: Stats Notes

Is the area under the normal curve enclosed by x-axis and the ordinates at X = anx X = a as shown below:

When= Z = = = 0

When X = a, Z = = = Z1 (say)

P ( <X<a) = P (0<Z<Z1) = (Z) dz

ExampleQ.A Sales Tax Officer has reported that the average sales of the 500

businesses that he has to deal with during a year amount to Rs. 36000 with a standard deviation of Rs. 10000. Assuming that the sales in these businesses are normally distributed, find:

(i) The number of businesses, the sales of which is over Rs. 40000.

(ii) The percentage of businesses, the sales of which are likely to range between Rs. 30000 and Rs. 40000.

P ( < x < a)

X = Z = 0

X = aZ = 1

X —

a –

Z1

0

1 2

Z1

0

Page 22: Stats Notes

(iii) The probability that the sales of a business selected at random will be over Rs. 30000.

Proportions of area under the Normal Curvez 0.25 0.40 0.50 0.60

Area 0.0987 0.1554 0.1915 0.2257

Solution: Let the variable X denote the sales (in Rs.) of the businesses during a year. Then we are given that:X~N (, 2), where = 36000 and 2 =10000.

(i)The probability that the sales of a business is over Rs. 40000 is given by

P (X>40000):

When X = 40000,

P (X>40000) = P (Z>0.4)=0.5 - P (0 Z 0.4)

= 0.5- 0.1554 = 0.3446

Hence in a group of 500 businesses, the expected number of businesses with annual sales over Rs. 40000:

500* 0.3446=172 Part (ii) and (iii) would be discussed in the class.

Questions for self study:

1. The sizes of components produced by a machine are normally distributed. It is required that the size should lie between 15.63 cm. And 15.84 cm. And it is found that 2.87% of the production is rejected for being oversize and 1.072% of the production is reject for being undersize. Find the mean and the standard deviation of the distribution of the component sizes.

2. Indicate which brand you will choose and why?

Mean Standard deviationBrand A 16,000 Km. 2,000 Km.Brand B 20,000 Km. 4,000 Km.

Page 23: Stats Notes

normal distribution, also indicate what percentage of brand B might be expected to run more than 24,000 kms.

3. This lifetime of a certain type of battery has a mean life of 400 hours and a standard deviation of 50 hours. Assuming normality for the distribution of life-time, find:(i) The percentage of batteries which have life-time of more than 350

hours.(ii) The life-time value above which the best 25 per cent of the batteries will

have their life, and(iii) The proportion of batteries that have a life-time between 300 hours and

500 hours.

SamplingThe process of sampling involves drawing a sample from a given population and using the sample data to make statistical inferences about the parameters of the population. These inferences may consist of:

Estimation of population parameter from the sample information. Testing the hypothesis related to a given population parameter in the

light of sample data.

PopulationPopulation is the aggregate of items or individuals under study in any statistical investigation.

Finite Population Infinite Population

SampleA finite subset of the population selected from it, with the objective of studying its characteristics is known as sample. Numbers of units in the sample are known as sample size.

ParametersStatistical constants of the population like Mean ( ), variance ( ), correlation

coefficient ( ).

StatisticsStatistical constants of the sample e.g. sample mean ( ), Sample variance ( ), Correlation coefficient (r).

Page 24: Stats Notes

Need for sampling (1) Sometimes it is impracticable to examine the entire population.(2) Destructive tests.(3) High costs of census.(4) Facilitate timely results.(5) More accurate results.

Types of Sampling Techniques Random sampling Non random sampling

Discussion area: Understanding different techniques under Random Sampling e.g. Simple

Random Sampling, Stratified Sampling and Cluster Sampling. How to perform Convenience Sampling and Judgement Sampling under Non

Random Sampling. (Problems would be discussed in the class sessions.)

Sampling distribution of a Statistic:The probability distribution of a statistic is called sampling distribution. e.g. the probability distribution of (sample mean) is called the sampling distribution of mean. It is customary to refer to the standard deviation of the sampling distribution as the standard error of the statistic.

~ N ( ) Sampling Distribution other than of Mean: Student's t-distribution Snedecor's F-distribution Chi-square ( ) distribution

HYPOTHESIS TESTING

A statistical hypothesis is some statement about the population, which may or may not be true. Under hypothesis testing, we have to test the validity of this statement on the basis of the evidence from a random sample.

Page 25: Stats Notes

Null Hypothesis: If we want to test any statement about the population, we setup null hypothesis which says that the statement is true e.g. if we want to find out if the population mean has a specified value say to, then null hypothesis is set up as:

Null Hypothesis: H0: = 0 Alternative Hypothesis: H1: ± 0

The null and alternative hypotheses are competing statements about the population. Either the null hypothesis (H0) is true or the alternative hypothesis (H1) is true, but not both. Ideally the hypothesis testing procedure should lead to the acceptance of Ho, when Ho is true and the rejection of Ho when H1 is true. Unfortunately, the correct conclusions are not always possible. Since hypothesis tests are based on sample information, we must allow for the possibility of errors. The following table illustrates the two kinds of errors that can be made in hypothesis testing:

Population Condition

Ho True H1 True

Accept Ho Correct Type II Conclusion Error

Conclusion

Reject Ho Type I Correct Error Conclusion

Although we cannot eliminate the possibility of errors in hypothesis testing, we can consider the probability of their Occurrence. Using common statistical notation, we denote the probabilities of making the two errors as follows.

= The probability of making Type I error = The probability of making Type II error

The maximum allowable limit for making Type I ( known as level of significance) error has to be specified before conducting the hypothesis testing. Common choices for level o significance are 0.05 and 0.01.

PROCEDURE FOR HYPOTHESIS TESTING

Step I: Set up the Null Hypothesis e.g. H0 : = Step II: Set up Alternative Hypothesis e.g. H1 : ± 0

Page 26: Stats Notes

Step III: Decide level of significance e.g. = 5%

Step IV: Compute test-statistic under the validity of null hypothesis as :

Z = N (0, 1)

Where E ( ) = µ , Z can be defined as a Standard Normal Variate (S.N.V.) of any statistic.

Step V: Conclusion:We compare the computed value of Z in step (iv) with the significant value Z (tabulated value) at given level of significance "”.

If Z < Z, then Z is not significant i.e. difference between the statistic and the parameter is just due to sampling fluctuations. H0 can be accepted & Vice-Versa.

Different Tests under Hypothesis Testing: Z-test T-test F-test test(These tests would be discussed in detail in the class.)

Problems on Hypothesis testing:Z-test

Example:

A stenographer claims that she can take dictation at the rate of 120 words per minute. Can we reject her claim on the basis of 100 trials in which she demonstrates a mean of 116 words with standard deviation of 15 words? Use 5% level of significance.

Page 27: Stats Notes

Solution: We set up the null hypothesis : Stenographers claim is true, i.e., µ = 120. In other words,

there is no significant difference between the population mean µ = 120 and

sample mean =116.

Alternate hypothesis is : µ 120 (Two tailed)

We are given: n = 100, =116, s = sample s.d. = 15

Hence, under the null hypothesis : µ = 120, the test statistic is:

Z =

= 116-120 15/100

= -2.67Since |Z| = 2.67, which is greater than 1.96, the value of Z is

significant at 5% level of significance and hence null hypothesis is rejected. Hence stenographer’s claim is not true.

Questions for self study

1. A random sample of 100 students gave a mean weight of 50 kgs. with standard deviation of 4 kgs. Test the hypothesis that the mean weight in the population is 60 kgs.

2. It is claimed that a random sample of 100 tyres with a mean life of 15269 kms. is drawn from a population of tyres which has a mean life of 15200 kms. and a standard deviation of 1248 kms. Test the validity of the claim.

t-testExample 1:Ten cartons are taken at random from an automatic filling machine. The mean net weight of the 10 cartons is 11.8 oz and standard deviation is 0.15 oz. Does the

Page 28: Stats Notes

sample mean differ significantly from the intended weight of 12 oz.? You are given that for degree of freedom = 9, = 2.26.

Solution: We are given:

N=10, =11, s = 0.15

Null hypothesis: 12 i.e., the sample mean =11.8 does not

differ significantly from the population mean = 12 Alternate hypothesis: 12

Test statistic. Under , the test statistic is

t = ~

t =

= - 4.0Tabulated value of t for 9 d.f. at 5% level of significance is 2.26. Since calculated | t | is much greater than tabulated t, it is highly significant. Hence, null hypothesis is rejected at 5% level of significance and we conclude that the sample mean differs significantly from the mean = 12oz.

Example 2: A machine is designed to produce insulating washers for electrical devices of average thickness of 0.025 cm. A random sample of 10 washers was found to have an average thickness of 0.024 cm with a standard deviation of 0.002 cm. Test the significance of the deviation. Value of t for 9 d.f. at 5% level is 2.262.

Solution: We are given:

N= 10, = 0.024 cm, s = 0.002 cm

Null hypothesis: 0.025 i.e. there is no significant

deviation between sample mean = 0.024 and

population mean = 0.025.Alternate hypothesis: 0.025

Page 29: Stats Notes

Under , the test statistic is

t = ~

Now t =

Tabulated value of t for 9d.f. = 2.262. Since | t | < 2.262, it is not significant

at 5% level of significance. Hence the deviation ( - ) is not significant.

Therefore the null hypothesis is accepted.

Example 3: The mean weekly sales of the chocolate bar in a candy stores were 146.3 bars per store. After an advertising campaign the mean weekly sales in 22 stores for a typical week increased to 153.7 and showed a standard deviation of 17.2. Was the advertisement campaign successful?

Solution: We are given: n = 22, =153.7, s = 17.2

Null hypothesis: 146. 3 i.e. the difference between and is

not significant. In other words the advertising campaign is not successful.

Alternate hypothesis: > 146.3, (right tail)Test statistic: Under the null hypothesis the test statistic is:

t = ~

Now t =

Tabulated value of t for 21d.f. at 5% level of significance for single tailed test is 1.721. Since the calculated value of t is greater than the tabulated value, it is significant. Hence advertisement campaign was successful in promoting sales.

Example 4: A soap-manufacturing unit was distributing a particular brand of soap through a large number of retail shops. Before a heavy advertisement campaign, the mean sale per week per shop was 140 dozens. After the campaign, a sample of 26 shops was taken and the mean sale was found to be 147 dozens with standard deviation 16. Can you consider the advertisement effective?

Solution: We are given: n = 26, = 147 dozens, s =16

Page 30: Stats Notes

Null hypothesis: = 140 dozens, i.e. the deviation between

and is just due to fluctuations of sampling. In other words advertisement is not effective.

Alternate hypothesis: > 140 (Right tail)Test statistic: Under the null hypothesis the test statistic is:

t = ~

t =

Tabulated value of t for 25 d.f. at 5% level of significance for single right tail test is 1.708. This is the value of for 25 d.f. for two tailed test. Since the calculated value of t is greater than the tabulated value, it is significant. Hence the increase in sales cannot be attributed to fluctuations of sampling and we conclude that the advertisement is certainly effective in increasing the sales.

Chi-square TestExample : The number of automobile accidents per week in a certain community was as follows:

12, 8, 20, 2, 14, 10, 15, 6, 9, 4Are these frequencies in agreement with the belief that accidents in a certain community were the same during this 10-week period?

Some more problems for chi-square test and F-test would be done in the class.

Regression Analysis

Study of the functional relationship between the variables is known as regression analysis. Correlation analysis brings out the degree of association between the variables and, the existing cause and effect relationship is explored by the regression analysis.The regression equations are useful for predicting the value of the dependent variable for given value(s) of the independent variable(s).

The linear regression model Ordinary least squares estimation

The linear regression model

Page 31: Stats Notes

In the linear regression model, the dependent variable is assumed to be a linear function of one or more independent variables plus an error introduced to account for all other factors:

In the above regression equation, yi is the dependent variable, xi1, ...., xiK are the independent or explanatory variables, and ui is the disturbance or error term. The goal of regression analysis is to obtain estimates of the unknown parameters

which indicate how a change in one of the independent variables affects the values taken by the dependent variable. Applications of regression analysis exist in almost every field. In economics, the dependent variable might be a family's consumption expenditure and the independent variables might be the family's income, number of children in the family, and other factors that would affect the family's consumption patterns. In education, the dependent variable might be a student's score on an achievement test and the independent variables characteristics of the student's family, teachers, or school. Ordinary least squares estimationThe usual method of estimation for the regression model is ordinary least squares (OLS). Let b1, ..., bK denote the OLS estimates of . The predicted value of yi is: The error in the OLS prediction of yi, called the residual, is:

The basic idea of ordinary least squares estimation is to choose estimates of in a way to minimize the sum of squared residuals i.e.

has to be minimised. Now Let us take a simple model of two variables:

Now for any set of data related to X and Y, it is possible to specify a line that approximates the mean of the Y for given values of X by using least square technique. By revealing how the mean of the Y changes as the various values of X change, this line is understood to describe the regression of Y on X. The regression line is the predicted value of Y for each value of X.

Page 32: Stats Notes

It is noteworthy that for the same set of related variables there is always a second regression line that describes the regression of X on Y.

Points of Discussion1. How to fit the regression line in case of two variable and three variable

model?2. How to interpret the regression coefficients?3. How to find the coefficient of determination?4. How to check the significance of regression coefficients?5. How to test the overall significance of the regression?

CASE STUDY-I

NATIONAL HEALTH CARE ASSOCIATIONThe National Health Care Association is concerned about the shortage of nurses the health care profession is projecting for the future. To learn the current degree of job satisfaction among the nurses, the association has sponsored a study of hospital nurses throughout the country. As part of this study, a sample of 50 nurses was asked to indicate their degree of satisfaction in their work, their pay, and their opportunities for promotion. Each of the three aspects of satisfaction was measured on a scale from 0 to 100, with larger values indicating higher degrees of satisfaction. The data is shown in the following table:

Work Pay Promotion Work Pay Promotion71 49 58 72 76 3784 53 63 71 25 7484 74 37 69 47 1687 66 49 90 56 2372 59 79 84 28 6272 37 86 86 37 5972 57 40 70 38 5463 48 78 86 72 7284 60 29 87 51 5790 62 66 77 90 5173 56 55 71 36 5594 60 52 75 53 9284 42 66 74 59 8285 56 64 76 51 5488 55 52 95 66 5274 70 51 89 66 6271 45 68 85 57 67

Page 33: Stats Notes

88 49 42 65 42 6890 27 67 82 37 5485 89 46 82 60 5679 59 41 89 80 6472 60 45 74 47 6388 36 47 82 49 9177 60 75 90 76 7064 43 61 78 52 72

Use methods of descriptive statistics to summarize the data. Present the summaries that will be beneficial in communicating the results to others. Discuss your findings. Specifically, comment on the following questions.

1. On the basis of the entire data set and the three job-satisfaction variables, what aspect of the job is most satisfying for the nurses? What appears to be the least satisfying? In what area(s), if any, do you feel improvements should be made? Discuss.

2. On the basis of descriptive measures of variability, what measure of job satisfaction appears to generate the greatest difference of opinion among the nurses? Explain.

CASE STUDY -II

Quality Associates Inc. is a consulting firm that advises its clients about sampling and statistical procedures that can be used to control their manufacturing processes. In one particular application, a client gave Quality Associates a sample of 800 observations taken during a time in which that client's process was operating satisfactorily. The sample standard deviation for these data was 0.21; hence, the population standard deviation was assumed to be 0.21. Quality Associates then suggested that random sample of size 30 is taken periodically to monitor the process on an ongoing basis. By analyzing the new samples, the client could quickly learn whether the process was operating satisfactorily. When the process was not operating satisfactorily corrective action could be taken to eliminate the problem. The design specification indicated the mean for the process should be 12. The hypothesis test suggested by Quality Associates follows.

H0: = 12H1: 12

Corrective action will be taken any time Ho is rejected.

Page 34: Stats Notes

The following samples were collected at hourly intervals during the first day of operation of the new statistical process control procedure.

Sample I Sample 2 Sample 3 Sample 4

11.55 11.62 11.91 12.0211.62 11.69 11.36 12.0211.52 11.59 11.75 12.0511.75 11.82 11.95 12.1811.90 11.97 12.14 12.1111.64 11.71 11.72 12.0711.80 11.87 11.61 12.0512.03 12.10 11.85 11.6411.94 12.01 12.16 12.3911.92 11.99 11.91 11.6512.13 12.20 12.12 12.1112.09 12.16 11.61 11.9011.93 12.00 12.21 12.2212.21 12.28 11.56 11.8812.32 12.39 11.95 12.0311.93 12.00 12.01 12.3511.85 11.92 12.06 12.0911.76 11.83 11.76 11.7712.16 12.23 11.82 12.2011.77 11.84 12.12 11.7912.00 12.07 11.60 12.3012.04 12.11 11.95 12.2711.98 12.05 11.96 12.2912.30 12.37 12.22 12.4712.18 12.25 11.75 12.0311.97 12.04 11.96 12.1712.17 12.24 11.95 11.9411.85 11.92 11.89 11.9712.30 12.37 11.88 12.2312.15 12.22 11.93 12.25

Case Questions:

1. Conduct the hypothesis test for each sample at the 0.1 level of significance and determine what action.

Page 35: Stats Notes

if any, should be taken. Provide the test statistic for each test.

2. Compute limits for the sample mean around = 12 such that, as Iong

as a new sample mean is within those limits, the process will be considered to be operating satisfactorily. If exceeds the upper limit or if

is below the lower limit, the corrective action will be taken. These limits are referred to as upper and lower control limits for quality control purposes.

3. Discuss the implications of changing the level of significance to a larger value. What mistake or error could increase if that was done?

CASE STUDY-III

Increasing use of Chemical fertilisers has played a profound role in increasing the productivity of Indian Agriculture. To meet the needs of the ever-growing population, India needs to increase the availability of fertilisers to its farmers. Moreover the constraint of limited land emphasizes the need for optimal use of the fertilisers in order to maintain the fertility of the land as well as to enhance its productivity. In order to achieve this objective Indian Government introduced Retention Pricing Scheme (RPS) in 1977 to encourage the production as well as consumption of the fertilisers. This two tier pricing system involved a price to the farmer controlled at a low level on one hand, and a fair price to the producer to fully cover the reasonable cost of production including a reasonable margin of profit on the other hand.

We can see from the following table that there has been a tremendous increase in fertiliser consumption in the post RPS period.

Page 36: Stats Notes

All India Fertiliser Consumption and NPK Ratio(‘000 tonnes)

Sources: Economic Survey 2000-2001 Fertiliser Statistics 2000-2001

At the beginning of 90s, the macro-economic situation had worsened with fiscal-deficit reaching as high as about 8 percent of GDP during 1990-91. Foreign exchange reserves were barely sufficient to meet just about 2 weeks imports and the inflation rate was running into double digits. All this led the economy to a stage whereby India was almost on the verge of defaulting on its external payment obligations. The government of India was thus forced to approach the International Monetary Fund (IMF) for financial assistance. The IMF in turn had laid down stiff conditions including amongst others removal of subsidies meant for providing the necessary support to the farmers. Hence the fertiliser subsidies also came under attack and the P&K sector was decontrolled. This initiative steeply rasied the prices of P & K fertilizers, which led to fall in their consumption. On the other hand, urea continued to remain under RPS and its selling price was reduced by 10 %. Though a scheme of concession on the decontrolled fertilisers was announced in June 1993 and urea price was raised by about 20 % w.e.f. June10 1994, the NPK ratio could not improve much because of the still prevailing massive difference in prices of nitrogenous, phosphatic and potassic fertilizers

Year

Fetiliser Consumption

N P2O5 K2O N+P2O5+K2O

NPK ratio

N P2O5 K2O

1960-611970-711980-811990-911991-921992-931993-941994-951995-961996-971997-981998-99

1999-2000

211.71479.33678.17997.28046.38426.88788.39507.19822.810301.810905.011353.811620.7

53.1541.01213.63221.03221.22843.82669.32931.72897.52976.83917.24112.24804.1

29.0236.3623.91328.01360.6883.9908.71124.81155.81029.61372.61331.51704.2

293.82256.65255.412546.212728.012154.512366.313563.613876.214308.116194.816797.518128.9

7.3 1.8 16.3 2.3 15.9 1.9 16.0 2.4 15.9 2.4 19.5 3.2 19.7 2.9 18.5 2.6 18.5 2.5 110.0 2.9 17.9 2.8 18.5 3.1 16.9 2.9 1

Ideal Consumption Ratio 4 2 1

Page 37: Stats Notes

Estimation of Fertiliser consumption in India:The following loglinear model has been taken for estimating the demand for nitrogenous fertiliser in India:

ln = + ln + ln + ln + ln + ln Y

Whereas = Consumption of the nitrogenous fertiliser in tth year.

= Percentage of gross irrigated area to gross cropped area in tth year. = Percentage of area under HYV to gross cropped area in tth year.

= Price of the nitrogenous fertilizer in tth year.

= Lagged dependent variable. Y = GDP of Agriculture, (it has been taken as a proxy variable for income of the farming community). are the estimates of the regression parameters.

Data on all variables except the income variable have been collected from various issues of Fertiliser Statistics, FAI, New Delhi. The data on the GDP of Agriculture, which has been used as a proxy for the income variable has been obtained from Indian Economic Survey 2000-2001. The study covers the period from 1973-74 to 1999-2000.

TIME

Y

1973-74 1829 23.7 15.32801 1050 1429881974-75 1765.7 25.4 16.64951 2000 1411851975-76 2148.6 25.3 18.61572 1850 1693371976-77 2456.9 26 20.05569 1750 1507661977-78 2913 26.8 22.60323 1550 1654101978-79 3419.5 27.6 22.95740 1450 1692481979-80 3498.1 29 22.63295 1550 1486631980-81 4068.7 28.8 24.95510 2000 1677701981-82 4224.2 29.1 26.30438 2350 1773411982-83 4242.5 30 27.49264 2150 1773001983-84 5204.4 30 29.92871 2150 193508

Page 38: Stats Notes

1984-85 5486.1 30.9 30.70379 2150 1963531985-86 5660.8 30.4 31.05457 2150 1983531986-87 5716 31.6 31.84286 2350 1987401987-88 5716.8 32.8 31.68736 2350 1967351988-89 7251 33.5 32.97399 2350 2270951989-90 7385.9 33.9 33.55790 2350 2313891990-91 7997.2 33.6 34.95529 2350 2420121991-92 8046.3 35.7 35.51502 3060 2392531992-93 8426.8 36 35.23580 2760 2522051993-94 8788.3 36.6 35.90235 2760 2620591994-95 9507.1 37.5 37.69977 3320 2760491995-96 9822.8 38 38.57129 3320 2751531996-97 10301.8 38.7 40.32224 3320 2994611997-98 10905 38.7 40.09644 3660 295050

The model has been estimated by using the above data applying OLS to the equation using the statistical package SHAZAM. The results are as under:

Results for Fertiliser Consumption Model for Nitrogen ( 1973-74 to 1999-2000)

Prameter

Estimated Coefficient T-ratio

Y Constant

0.69438 1.697 0.96198 6.5989

-0.2007 -2.9586

0.28184 2.6268

0.37320 3.0783

-2.4663 -2.7432

D-W statisticF (5, 21)

0.9953 2.15431096.026(2.66)*

Page 39: Stats Notes

* : Value of F statistic at 5% level of significance

Moving along the policy of economic liberalization and reforms, the deregulated regime in all the three types of fertilizers is no doubt the long-term goal. But any contemplated move towards a free market economy has to be gradual so as to prevent adverse impact on consumption of fertilisers and ultimately on the productivity of the foodgrains. Therefore, the government should come out with a stable and clear-cut policy without further loss of time. The policy must aim at protecting the domestic industry as well as the interests of the farmers and it would benefit the community at large.

Questions:(To be discussed in the class)

CASESTUDY-IV

A university is typically required to prepare operating budget well in advance of actually receiving its revenues and incurring the expenditures. An important source of revenue is student tuition, which is obviously a function of the number of student enrolled. A university was having problems in preparing accurate budgets because past forecasts of enrollment, made each February before the start of the academic year in September, were subject to considerable error. Once aspect of the problem was determining the, relationship between the numbers of applications received by February 1 and the number of new students entering the university in the following September. The data tabulated below were collected on September registrations and February 1 applications.

Year Number of Applications Number of New StudentsReceived by February1 Enrolled in September

(Hundreds) (Hundreds)1990 28 241991 26 201992 28 181993 28 221994 36 321995 36 33

Page 40: Stats Notes

1996 42 341997 46 341998 46 351999 50 38

a. Given the nature of the forecasting problem, which variable would be the dependent variable and which would be the independent variable?

b. Plot the data.c. Determine the estimated regression line. Give an economic

interpretation of the slope () coefficient.d. Test the hypothesis that there is not relationship (that is, =0)

between the Variables.e. Calculate the coefficient of determination.f. Perform an analysis of variance on the regression, including an F-test

of the overall significance of the results.g. Suppose 4,200 applications-are received by February 1. What is the

best estimate, based on the regression model, of the number of new students that will be, enrolled in the following September?

Use SPSS for the entire analysis

____________________