qt business statistics-lesson1-2013

Prof. P.K.Suri School of Management,

Delhi Technological University

Quantitative Techniques (Statistics)

Fundamentals

• Statistics is a science. It is a way to get information from data to facilitate decision making or interpretation.

• Examples of Data:

− Daily wholesale prices and arrivals of a particular agricultural produce (say wheat) in particular markets for last 3 months ;

− Marks of students in QT for last three years;

− Weather data of a city for last 30 days,

− House-wise, model-wise ownership of cars in a particular locality, etc.

(Explain associated interpretations/ decisions).

Fundamentals

• Variable: A characteristic of an item or individual

Variable Types:

Categorical variables (Qualitative variables):(have values that can be placed only in categories e.g. Are you married: Yes/No)

Numerical variables (Quantitative variables): (have values that represent quantities)- Are of two types : Discrete and Continuous

• Data: The set of individual values associated with a variable

Fundamentals (contd.)Data can be:

quantitative or qualitative in grouped or ungrouped form.

Quantitative data can be subjected to arithmetic operations unlike qualitative data.

The field of statistics deals with measurements (Quantitative or Qualitative).

Four generally used scales of measurement (from weakest to strongest):

To describe values of a categorical variable, we use: Nominal scale and Ordinal scale

To describe values of a numerical variable, we use: Interval scale and Ratio scale

Fundamentals (contd.)

Nominal Scale: Here numbers are used simply as labels for categories. For example, an employee may be (M) Male/ (F) Female (even if numbers are assigned to categories, these are arbitrary); Weakest scale because you cannot specify any ranks across categories

Ordinal Scale: Here, data elements are ordered according to their relative merit. Ex. A product may be ranked as 1, 2, 3 or 4 where 1 denotes worst quality and 4 the best quality. Ordinal scale does not tell us how much better a product is than others. It only tells that it is better.Thus, ordinal scale is weaker in the sense that it is silent about the amount of difference between categories.


• Interval Scale: An ordered scale in which the difference between measurement is an meaningful quantity but does not involve a true zero point.

the value of 0 is assigned arbitrarily and thus we cannot take ratio of two measurements. But we can take ratio of intervals.

Ex.: 70 C is 2 degrees warmer than 50 C and so is a comparison between 700 C and 720 C but the environmetal conditions are

totally different.

Ex: Time of a day is in interval scale. We cannot say that 10 AM is twice as long as 5 AM. But we can say that interval between 0 AM and 10 AM (10 hrs) is twice as long as interval between 0 AM and 5 AM (5 hrs). This is because 0 AM does not mean absence of any time.


• Ratio Scale: An ordered scale in which the difference between measurement is an meaningful quantity and involves a true 0 point (0 is in ratio scale is an absolute 0). Strongest scale.

• If two measurements are in ratio scale, then we can take ratios of those measurements.

Ex. Money is measured in ratio scale. A sum of Rs. 0 means no money and is thus an absolute zero. A sum of Rs. 100 is twice as large as Rs. 50. Other examples are height, weight, volume, area, length.

[Note that in interval scale, the interval between two interval scale measurements is in ratio scale (not the individual observations). ]


Collecting DataData Sources

Primary Data (Data which you collect yourself for doing analysis)

Secondary Data (Data which is collected by someone else and you use for doing analysis)

Sources could be:Data distributed by an organization or individual (e.g. Centre for Monitoring Indian Economy: www.cmie.com; CRISIL: www.crisil.com; Nielsen: provide consumer research data to telecom and mobile media companies)The outcomes of a designed experiment The responses from a surveyThe results of an observational studyData collected by ongoing business activities

Samples and Population

•The distinction between sample and population is very important in statistics.

•A population is the group of all items of interest to an investigator (not necessarily group of people). Also called universe. In DTU campus, it may be population of B.Tech. students, population of MBA students, population of faculty members, etc. Other examples, Population of weights of cricket bats produced in a factory, population of cows in a village, etc.

•A descriptive measure of population is called parameter e.g. average weight of bats produced, average milk given by cows in a village.

•A sample is a subset of units selected from a population (sampling units vs sampled units)

•A descriptive measure of sample is called statistic e.g. average weight of a sample of bats, average milk given by sampled cows.


Sampling

• A sample is drawn from a population using a sampling procedure.

Non Probability Samples Judgment SamplesConvenient Samples

Probability Samples Simple Random Sampling (SRS) (With or Without replacement)

Stratified Sampling Systematic Sampling Cluster Sampling, etc.

• The aim is to get a representative sample of the population so that it leads to near accurate inferences about the population parameters.

•


Sampling Frame

To be prepared before sampling.

Partial sampling frame may lead to misleading results (e.g. when you exclude a particular group of people).


When do we prefer sampling over census approach of data collection?

• When selecting a sample is less time consuming than selecting every item of the population

•When selecting a sample is less costly than selecting every item of the population

•Analyzing a sample is less cumbersome than analyzing enitre population

• Data Cleaning: Removing outliers


Statistical Inference

• A conclusion drawn about a population based on the information in a sample from the population is called a statistical inference.

• We use sample statistics to make inference about population parameters.

• Conclusion about a population based on the sample statistics may not always be correct. Therefore, we use measures of reliability while undertaking statistical inference. Two such measures are:

– Confidence level and

– Significance level.

•


Statistical Inference

• Confidence level is the proportion of times an estimation procedure will be correct. For example, if we use an estimation procedure and produce an estimate that has a confidence level of 95% that would mean – In the long run, estimates based on this estimation procedure will be correct 95% of the time.

• Significance level measures how frequently a conclusion drawn about the population will be wrong in the long run. A 5% significance level means that, in the long run, a conclusion drawn would be wrong 5% of the time.


Sampling

• e.g. a farmer ‘X’ has 1500 sheep. These constitute the entire population of sheep for farmer ‘X’. If 15 sheep are selected from this population, it will form a sample of 15 sheep from the population of 1500 sheep. Further, if these 15 sheep are selected at random, the sample would be a simple random sample.

• Note that Sample and Population are relative to each other. If we consider the entire district with 20,000 sheep, the 1500 sheep with farmer ‘X’ could be one sample of the district population of sheep (though not a random sample of 1500 sheep from the district).


Types of Survey Errors• Validity of survey results must be examined. We must evaluate the

purpose of survey and for whom it is conducted.• Inferences based on non probability samples could be seriously

misleading• The only way to make valid statistics inference about population is by

using a probability sample. • Even surveys based on probabilistic samples are subject to four types of

errors:

- Coverage error

- Nonresponse error

- Sampling error

- Measurement error•


Types of Survey Errors (contd.)

• Our aim should be to minimize these four errors.

Ex.

non-response bias i.e. bias introduced when we ignore the fact that certain people may not respond to few questions. The bias gets introduces when such people belong more to one segment. E.g. consider a question “Have you ever been arrested?” There may be poor response to this question from people who have indeed been arrested.


Examples: Use of Statistical Inference in Business Situations •A pharmaceutical manufacturer interested in marketing a new drug may be required to prove that the drug does not cause any side effects. The drug may be tested on a random sample of people and the technique of statistical inference may be used to draw conclusion about the entire population.

•To assess the popularity of its ATMs, a bank may seek opinion of a randomly selected sample of customers. Statistical inference can be used to generalize the conclusions for the entire population of bank’s customers.

•A quality control engineer at a plant making bulbs needs to ensure that not more than 3 % of the bulbs produced are defective. The engineer may periodically collect random samples of bulbs and check their quality. Based on the random samples, the engineer can draw conclusion about the proportion of defective items in the entire population of bulbs.


Descriptive StatisticsPercentiles and Quartiles

Percentiles• The Pth percentile of a group of numbers is that value below which

lie P% of the numbers in the group. The position of Pth percentile is given by (n+1)P/100, where n is the number of data points.

• Ex: sales made by each of the 20 sales persons of a departmental store are as follows:

• (arranged in ascending order – to be done in case data is not ordered)

6,9,10,12,13,14,14,15,16,16,16,17,17,18,18,19,20,21,22,24.

50th percentile: 10.5 i.e.16

80th percentile: 16.8 i.e.19.8

90th percentile: 18.9 i.e. 21.9


Descriptive Statistics

Quartiles

• Quartile are special percentiles which break the distribution of data into four groups.

• The first quartile is the 25th percentile. It is the point below which lie one fourth of data. Also called lower quartile.

• The second quartile is the 50th percentile. It is the point below which lie one half of data (also called median). Also called middle quartile.

• The third quartile is the 75th percentile. It is the point below which lie 75 % of data. Also called upper quartile.

• The difference between third and first quartile is called interquartile range. It is a measure of spread of data.

Exercise: Interquartile range for above example is 18.75 – 13.25 = 5.5.


Descriptive Statistics

Measures of Central Tendency

Common measures of central tendency (centre of data) : mean, median, mode.

• Mean or Arithmetic Mean or Average: – Strengths

– Limitations

(Sample Mean, Population Mean)

• Median– Strengths

– Limitations

• Mode– Strengths

– Limitations


qt business statistics-lesson1-2013

Technology

ratio scale

ordered scale

nominal scale

strongest scale

interval scale measurements

quantitative data

qualitative data

examples of data