cfa level 1 quantitative analysis e book - part 4(1)

8/13/2019 CFA Level 1 Quantitative Analysis E Book - Part 4(1)

1/40

CFAQ

uantitativeAna

lysisE-Book4of8

www.educorporatebridge.comAll Rights Reserved. Corporate Bridge TM

Quantitative Analysis E-Book

Part 4 of 8


2/40

CFAQ

uantitativeAna

lysisE-Book4of8


Sampling and Estimation


3/40

CFAQ

uantitativeAna

lysisE-Book4of8


1. Introduction.

In investment analysis, it is often impossible to study every member of the population. Even if analysts

could examine the entire population, it may not be economically efficient to do so. Sampling is the processof obtaining a sample. A simple random sample is a sample obtained in such a way that each element of the

population has an equal probability of being selected. The selection of any one element has no impact on

the chance of selecting another element.

A sample is random if the method for obtaining the sample meets the criterion of randomness (each

element having an equal chance at each draw). The word 'simple' tells you that the process is not difficult,

and the word 'random' tells you that you don't know in advance which observations will be selected in thesample. The actual composition of the sample itself does not determine whether or not it's a random

sample.

Example

Suppose that a company has 30 directors, and you wish to choose 10 of them to serve on a committee. You

could place the names of the 30 directors on separate pieces of paper, and draw them out one by one, untilyou have drawn a sample of size 10.

Note: that the conditions for simple random sampling have been satisfied in that every one of the 30

directors has an equal (non-zero) chance of being selected in the sample.


4/40

CFAQ

uantitativeAna

lysisE-Book4of8


In this example, it makes no sense to sample with replacement, as this would mean that once you have

drawn a name, that name goes back into the hat (i.e. it is replaced), and can be drawn again. If the samepersonsname is drawn more than once, then you won't end up with a sample of size 10 if you draw 10

names, so this experiment should be done without replacement.

A biased sample is one in which the method used to create the sample results in samples that are

systematically different from the population. For instance, consider a research project on attitudes toward

cricket. Collecting the data by publishing a questionnaire in a magazine and asking people to fill it out and

send it in would produce a biased sample. People interested enough to spend their time and energy fillingout and sending in the questionnaire are likely to have different attitudes toward cricket than those not

taking the time to fill out the questionnaire.

It is important to realize that it is the method used to create the sample not the actual make up of the sample

itself that defines the bias. A random sample that is very different from the population is not biased: it is by

definition not systematically different from the population. It is randomly different.


5/40

CFAQ

uantitativeAna

lysisE-Book4of8


SAMPLING ERROR

The sample taken from a population is used to infer conclusions about the population. However, it's

unlikely that the sample statistic would be identical to the population parameter. Suppose there is a class of

100 students, and a sample of size 10 from that class is chosen. If by chance most of the brightest students

in this sample are selected, then there is a misguided idea of what the population looks like, because the

sample mean x-bar will be much higher than the population mean in this case. Equally, a sample

comprising mainly weaker students could be chosen, and then the opposite would have applied. he ideal is

to have a sample, which comprises a few bright students, a few weaker students, and mainly averagestudents, as this will give a good idea of the composition of population. However, because which items go

into the sample cannot be controlled, you are dependent to some degree on chance as to whether the results

are favorable or not.

Sampling error (also called error of estimation) is the difference between the observed value of a statistic

and the quantity it is intended to estimate. For example, sampling error of the mean equals sample mean

minus population mean.


6/40

CFAQ

uantitativeAna

lysisE-Book4of8


Sampling error can apply to statistics such as the mean, the variance, the standard deviation or any other

values that can be obtained from the sample. The sampling error varies from sample to sample. A goodestimator is one whose sample error distribution is highly concentrated about the population parameter

value.

Sampling error of the mean would be: Sample mean - population mean = x-bar

Sampling error of the standard deviation would be: Sample standard deviation - population standard

deviation = s - .


7/40

CFAQ

uantitativeAna

lysisE-Book4of8


Sampling distribution

A sample statistic itself is a random variable, which varies depending upon the composition of the sample.

It therefore has a probability distribution. The sampling distribution of a statistic is the distribution of all

the distinct possible values that the statistic can assume when computed from samples of the same size

randomly drawn from the same population. The most commonly used sample statistics include mean,

Variance and standard deviation.

If you compute the mean of a sample of 10 numbers, the value you obtain will not equal the populationmean exactly; by chance it will be a little bit higher or a little bit lower. If you sampled sets of 10 numbers

over and over again (computing the mean for each set), you would find that some sample means come

much closer to the population mean than others. Some would be higher than the populations mean and

some would be lower. Imagine sampling 10 numbers and computing the mean over and over again, say

about 1,000 times, and then constructing a relative frequency distribution of those 1,000 means. This

distribution of means is a very good approximation to the sampling distribution of the mean. The sampling

distribution of the mean is a theoretical distribution that is approached as the number of samples in therelative frequency distribution increases. With 1,000 samples, the relative frequency distribution is quite

close; with 10,000 it is even closer. As the number of samples approaches infinity, the relative frequency

distribution approaches the sampling distribution.


8/40

CFAQ

uantitativeAna

lysisE-Book4of8


The sampling distribution of the mean for a sample size of 10 is just an example; there is a different

sampling distribution for other sample sizes. Also, keep in mind that the relative frequency distributionapproaches a sampling distribution as the number of samples increases, not as the sample size increases

since there is a different sampling distribution for each sample size.

A sampling distribution can also be defined as the relative frequency distribution that would be obtained if

all possible samples of a particular sample size were taken. For example, the sampling distribution of the

mean for a sample size of 10 would be constructed by computing the mean for each of the possible ways in

which 10 scores could be sampled from the population and creating a relative frequency distribution ofthese means. Although these two definitions may seem different, they are actually the same: Both

procedures produce exactly the same sampling distribution.

Statistics other than the mean have sampling distributions too. The sampling distribution of the median is

the distribution that would result if the median instead of the mean were computed in each sample.

Sampling distributions are very important since almost all inferential statistics are based on samplingdistributions.


9/40

CFAQ

uantitativeAna

lysisE-Book4of8


Simple random vs. stratified random sampling

In stratified random sampling, the population is subdivided into subpopulations (strata) based on one or

more classification criteria. Simple random samples are then drawn from each stratum (The sizes of the

samples are proportional to the relative size of each stratum in the population). These samples are then

pooled.

It is important to note that the size of the data in each stratum does not have to be the same or even similar,

and frequently isn't.

Stratified random sampling guarantees that population subdivisions of interest are represented in the

sample. The estimates of parameters produced from stratified sampling have greater precision (i.e. smaller

variance or dispersion) than estimates obtained from simple random sampling.


10/40

CFAQ

uantitativeAna

lysisE-Book4of8


For example, investors may want to fully duplicate a bond index by owning all the bonds in the index in

proportion to their market value weights. This is known as pure bond indexing. However, it's difficult andcostly to implement because a bond index typically consists of thousands of issues. If simple sampling is

used, the sample selected may not accurately reflect the risk factors of the index. Stratified random

sampling can be used to replicate the bond index.

Divide the population of index bonds into groups with similar risk factors (e.g. issuer, duration/maturity,

coupon rate, credit rating, call exposure, etc.). Each group is called a stratum or cell.

Select a sample from each cell proportional to the relative market weighting of the cell in the index.

A stratified sample will ensure that at least one issue in each cell is included in the sample.


11/40

CFAQ

uantitativeAna

lysisE-Book4of8


Time-series and cross-sectional data.

Data come in many different shapes and sizes, and measure many different things at different times. Often

financial analysts are interested in particular types of data such as time-series data or cross-sectional data.

Time-series data is a set of observations collected at usually discrete and equally spaced time intervals. For

example, the daily closing price of a certain stock recorded over the last six weeks is an example of time

series data. Note that a too long or too short time period may lead to time-period bias. Refer to subject g for

details.

Other examples of time-series would be staff numbers at a particular institution taken on a monthly basis in

order to assess staff turnover rates, weekly sales figures of ice-cream sold during a holiday period at a

seaside resort and the number of students registered for a particular course on a yearly basis. All of the

above would be used to forecast likely data patterns in the future.


12/40

CFAQ

uantitativeAna

lysisE-Book4of8


Cross-sectional data are observations that coming from different individuals or groups at a single point in

time. For example, if one considered the closing prices of a group of 20 different tech stocks on December15, 1986 this would be an example of cross-sectional data. Note that the underlying population should

consist of members with similar characteristics. For example, suppose you are interested in how much

companies spend on research and development expenses. Firms in some industries such as retail spend

little on research and development (R&D), while firms in industries such as technology spend heavily on

R&D. Therefore, it's inappropriate to summarize R&D data across all companies. Rather, analysts should

summarize R&D data by industry, and then analyze the data in each industry group.

Other examples of cross-sectional data would be: an inventory of all ice creams in stock at a particular

store, a list of grades obtained by a class of students for a specific test.


13/40

CFAQ

uantitativeAna

lysisE-Book4of8


2. The Central Limit Theorem.

The central limit theorem states that given a distribution with a mean and variance 2, the sampling

distribution of the mean x-bar approaches a normal distribution with a mean ()and a variance 2/N as N,the sample size, increases.

The amazing and counter-intuitive thing about the central limit theorem is that no matter what the shape of

the original distribution, x-bar approaches a normal distribution.

If the original variable X has a normal distribution, then x-bar will be normal regardless of the

sample size.

If the original variable X does not have a normal distribution, then x-bar will be normal only if N

30. This is called a distribution free result. This means that no matter what distribution X has,

will still be normal for sufficiently large n.

Keep in mind that N is the sample size for each mean and not the number of samples. Remember in a

sampling distribution the number of samples is assumed to be infinite. The sample size is the number of

scores in each sample; it is the number of scores that goes into the computation of each mean.


14/40

CFAQ

uantitativeAna

lysisE-Book4of8


Two things should be noted about the effect of increasing N:

1. The distributions become more and more normal.

2. The spread of the distributions decreases.

Based on the central limit theorem, when the sample size is large, you can:

1. Use the sample mean to infer the population mean.2. Construct confidence intervals for the population mean based on the normal distribution.

Note that the central limit theorem does not prescribe that the underlying population must be normally

distributed. Therefore, the central limit theorem can be applied on a population with any probability

distribution.


15/40

CFAQ

uantitativeAna

lysisE-Book4of8


3. Standard Error of the Sample Mean.

The standard error of a statistic is the standard deviation of the sampling distribution of that statistic.

Standard errors are important because they reflect how much sampling fluctuation a statistic will show. The

inferential statistics involved in the construction of confidence intervals and significance testing are based

on standard errors. The standard error of a statistic depends on the sample size. In general, the larger the

sample size, the smaller the standard error. The standard error of a statistic is usually designated by the

Greek letter sigma ()with a subscript indicating the statistic.

The standard error of the mean is designated as: m. It is the standard deviation of the samplingdistribution of the mean. The formula for the standard error of the mean is: m= /N1/2, where is the

standard deviation of the original distribution and N is the sample size (the number of scores each mean is

based upon). This formula does not assume a normal distribution. However, many of the uses of the

formula do assume a normal distribution. The formula shows that the larger the sample size, the smaller the

standard error of the mean. More specifically, the size of the standard error of the mean is inversely

proportional to the square root of the sample size


16/40


17/40

CFAQ

uantitativeAna

lysisE-Book4of8


Example 2

Suppose that the mean grade of students in a class is unknown, but a sample of 30 students is taken from

the class, and the mean from the sample is found to be 60%, with a standard deviation of 9%. Calculate the

standard error of the sample mean, and interpret your results.

Now, and are unknown, but m is given as 60 and s Now, and are unknown, but m is given as 60 and

s is given as 9. Since n = 30, you can estimate the standard error of the sample mean as: 9/301/2 = 1.6432.

This means that if you took all possible samples of size 30 from the class, you would estimate the standarderror to be 1.6432.

It is important to note that when you have , you must use it; but when you don't, you use its sample

equivalent s.


18/40

CFAQ

uantitativeAna

lysisE-Book4of8


4. Estimators.

Very often, there are a number of different estimators that can be used to estimate unknown population

parameters. When faced with such a choice, it is desirable to know that the estimator chosen is the "best"under the circumstances, that is, it has more desirable properties than any of the other options available to

us. There are three desirable properties of estimators:

1. Unbiasedness An estimator's expected value (the mean of its sampling distribution) equals the

parameter it is intended to estimate. For example, the sample mean is an unbiased estimator of the

population mean, because the expected value of the sample mean is equal to the population mean.

2. Efficiency An estimator is efficient if no other unbiased estimator of the sample parameter has asampling distribution with smaller variance. That is, in repeated samples, analysts expect the

estimates from an efficient estimator to be more tightly grouped around the mean than estimates

from other unbiased estimators. For example, the sample mean is an efficient estimator of the

population mean, and the sample variance is an efficient estimator of the population variance.

3. Consistency A consistent estimator is one for which the probability of accurate estimates

(estimates close to the value of the population parameter) increases as sample size increases. Inother words, a consistent estimator's sampling distribution becomes concentrated on the value of

the parameter it is intended to estimate as the sample size approaches infinity. For example, as the

sample size increases to infinity, the standard error of the sample mean declines to 0, and the

sampling distribution concentrates around the population mean. Therefore, the sample mean is a

consistent estimator of the population mean.


19/40

CFAQ

uantitativeAna

lysisE-Book4of8


The single estimate of an unknown population parameter calculated as a sample mean is called point

estimate of the mean. The formula used to compute the point estimate is called an estimator. The specificvalue calculated from sample observations using an estimator is called an estimate. For example, the

sample mean is a point estimate of the population mean. Suppose two samples are taken from a population,

and the sample means are 16 and 21 respectively. Therefore, 16 and 21 are two estimates of the population

mean. Note that an estimator will yield different estimates as repeated samples are taken from the sample

population.

A confidence interval is an interval for which one can assert with a given probability 1 - , called thedegree of confidence, that it will contain the parameter it is intended to estimate. This interval is often

referred to as the (1 - )% confidence interval for the parameter, where is referred to as the level of

significance. The end points of a confidence interval are called the lower and upper confidence limits.

For example, suppose that a 95% confidence interval for the population mean is 20 to 40. This means that

There is a 95% probability that the population mean lies in the range of 20 to 40;

"95%" is the degree of confidence; "5%" is the level of significance;

20 and 40 are the lower and higher confidence limits, respectively.


20/40

CFAQ

uantitativeAna

lysisE-Book4of8


5. Confidence Intervals for the Population Mean.

Confidence intervals are typically constructed by using the following structure:

Confidence Interval = Point Estimate Reliability Factor x Standard Error

Point estimate is the value of a sample statistic of the population parameter.

Reliability factor is a number based on the sampling distribution of the point estimate and the

degree of confidence (1 - ).

Standard error refers to the standard error of the sample statistic that is used to produce the point

estimate.

Whatever the distribution of the population, the sample mean is always the point estimate used to construct

the confidence intervals for the population mean. The reliability factor and the standard error, however,

may vary depending on three factors:

1. Distribution of population: normal or non-normal.

2. Population variance: known or unknown.

3. Sample size: large or small.


21/40

CFAQ

uantitativeAna

lysisE-Book4of8


z-Statistic: a standard normal random variable

If a population is normally distributed with a knownvariance, z-statistic is used as the reliability factor to

construct confidence intervals for the population mean.

In practice, the population standard deviation is rarely known. However, learning how to compute a

confidenceinterval when the standard deviation is known is anexcellent introduction to how to compute a

confidence interval when the standard deviation has to beestimated.

Three values are used to construct a confidence interval for :

1. The sample mean (m);

2. The value of z (which depends on the level of confidence), and

3. The standard error of the mean ()m.

The confidence interval has m for its center and extends a distance equal to the product of z and in bothdirections. Therefore, the formula for a confidence interval is:

m - z m = = m + z m


22/40

CFAQ

uantitativeAna

lysisE-Book4of8


For a (1 - )% confidence interval for the population mean, the z-statistic to be used is Z /2. Z /2 denotes

the points of the standard normal distribution such that /2 of the probability falls in the right-hand tail.

Effectively, what is happening is that the (1 - )% of the area that makes up the confidence interval falls in

the center of the graph, that is, symmetrically around the mean. This leaves % of the area in both tails, or

/2 % of area in each tail.

Commonly used reliability factors are as follows:

90% confidence intervals: z0.05= 1.645. is 10%, with 5% in each tail.

95% confidence intervals: z0.025= 1.96. is 5%, with 2.5% in each tail.

99% confidence intervals: z0.005= 2.575. is 1%, with 0.5% in each tail.


23/40

CFAQ

uantitativeAna

lysisE-Book4o

f8


Example

Assume that the standard deviation of SAT verbal scores in a school system is known to be 100. A

researcher wishes to estimate the mean SAT score and compute a 95% confidence interval from a random

sample of 10 scores.

The 10 scores are: 320, 380, 400, 420, 500, 520, 600, 660, 720, and 780. Therefore, m = 530, N = 10, and

m= 100 / 101/2 = 31.62. The value of z for the 95% confidence interval is the number of standard

deviations one must go from the mean (in both directions) to contain .95 of the scores.

It turns out that one must go 1.96 standard deviations from the mean in both directions to contain .95 of the

scores the value of 1.96 was found using a z table. Since each tail is to contain .025 of the scores, you find

the value of z for which 1 - 0.025 = 0.975 of the scores are below. This value is 1.96.


24/40

CFAQ

uantitativeAna

lysisE-Book4o

f8


All the components of the confidence interval are now known: m = 530, m= 31.62, z = 1.96.

Lower limit = 530 - (1.96)(31.62) = 468.02

Upper limit = 530 + (1.96)(31.62) = 591.98

Therefore, 468.02 591.98. This means that the experimenter can be 95% certain that the mean SAT in

the school system is between 468 and 592. This also means if the experimenter repeatedly took samples

from the population and calculated a number of different 95% confidence intervals using the sample

information, on average 95% of those intervals would contain . Notice that this is a rather large range of

scores. Naturally, if a larger sample size had been used, the range of scores would have been smaller.


25/40

CFAQ

uantitativeAna

lysisE-Book4o

f8


The computation of the 99% confidence interval is exactly the same except that 2.58 rather than 1.96 is

used for z. The 99% confidence interval is: 448.54 = = 611.46. As it must be, the 99% confidence intervalis even wider than the 95% confidence interval.

Summary of Computations

1. Compute m = X/N.

2. Compute m= /N1/2

3. Find z (1.96 for 95% interval; 2.58 for 99% interval)

4. Lower limit = m - z m

5. Upper limit = m + z m

6. Lower limit = = Upper limit

Assumptions:

1. Normal distribution

2. is known

3. Scores are sampled randomly and are independent


26/40

CFAQ

uantitativeAna

lysisE-Book4o

f8


There are three other points worth mentioning here:

The point estimate will always lie exactly at the midway mark of the confidence interval. This is because it

is the "best" estimate for ,and so the confidence interval expands out from it in both directions.

The higher the percentage of confidence, the wider the interval will be. This is because as the percentage is

increased, a wider interval is needed to give us a greater chance of capturing the unknown population value

within that interval.

The width of the confidence interval is always twice the part after the positive or negative sign, that is,

twice the reliability factor x standard error. The width is simply the upper limit minus the lower limit.

It is very rare for a researcher wishing to estimate the mean of a population to already know its standard

deviation. Therefore, the construction of a confidence interval almost always involves the estimation of

both and .


27/40

CFAQ

uantitativeAna

lysisE-Book4o

f8


STUDENTS' T-DISTRIBUTION

When is known, the formula m - z m= = m + z mis used for a confidence interval. When is not

known, m = s/N1/2 (N is the sample size) is used as an estimate of and . Whenever the standard

deviation is estimated, the t rather than the normal (z) distribution should be used. The values of t are larger

than the values of z so confidence intervals when is estimated are wider than confidence intervals when

is known. The formula for a confidence interval for when is estimated is:

m - t sm= = m + t sm

Where m is the sample mean, sm is an estimate of m, and t depends on the degrees of freedom and the

level of confidence.


28/40

CFAQ

uantitativeAna

lysisE-Book4o

f8


The t-distribution is a symmetrical probability distribution defined by a single parameter known as degrees

of freedom (df). Each value for the number of degrees of freedom defines one distribution in this family ofdistributions. Like a standard normal distribution (e.g. a z-distribution), the t-distribution is symmetrical

around its mean. Unlike a standard normal distribution, the t-distribution has the following unique

characteristics.

It is an estimated standardized normal distribution. When n gets larger, t approximates z (s

approaches ).

The mean is 0, and the distribution is bell shaped. There is not one t-distribution, but a family of t-distributions. All t-distributions have the same

mean of 0. Standard deviations of these t-distributions differ according to the sample size, n.

The shape depends on degrees of freedom (n - 1). The t-distribution is less peaked than a standard

normal distribution, and has fatter tails (i.e. more probability in the tails).

t/2 tends to be greater than z/2for a given level of significance, .

Its variance is v/(v-2) (for v > 2), where v = n-1. It is always bigger than 1. As v increases, the

variance approaches 1.


29/40

CFAQ

uantitativeAna

lysisE-Book4o

f8


The value of t can be determined from a t table. The degrees of freedom for t is equal to the degrees offreedom for the estimate of mwhich is equal to N-1.


30/40

CFAQ

uantitativeAna

lysisE-Book4o

f8


A portion of t-table is presented as below:

Level of significance (a) for two-Tailed Test

Suppose the sample size (n) is 30, and the level of significance () is 5%. df = n - 1 = 29. t/2= t0.025=

2.045 (Find the 29 df row, and then move to the 0.05 column).

Cff 0.20 0.10 0.05 0.02 0.01

1 3.078 6.314 12.706 31.821 63.657

2 1.886 2.920 4.303 6.965 9.925

29 1.311 1.699 2.045 2.462 2.756

30 1.310 1.697 2.042 2.457 2.750


31/40

CFAQ

uantitativeAna

lysisE-Book4o

f8


Example

Assume a researcher is interested in estimating the mean reading speed (number of words per minute) of

high-school graduates and computing the 95% confidence interval. A sample of 6 graduates was taken and

the reading speeds were: 200, 240, 300, 410, 450, and 600. For these data,

m = 366.6667

sm= 60.9736

df = 6-1 = 5

t = 2.571

Therefore, the lower limit is: m - (t) (sm) = 209.904 and the upper limit is: m + (t) (sm) = 523.430.

Therefore, the 95% confidence interval is: 209.904 = = 523.430

Thus, the researcher can be 95% sure that the mean reading speed of high-school graduates is between209.904 and 523.430.


32/40

CFAQ

uantitativeAna

lysisE-Book4o

f8


Summary of Computations

1. Compute m = X/N.2. Compute s

3. Compute m= s/N1/2

4. Compute df = N-1

5. Find t for these df using a t table

6. Lower limit = m - t s m

7. Upper limit = m + t sm

8. Lower limit = = Upper limit

Assumptions:

1. Normal distribution

2. Scores are sampled randomly and are independent


33/40

CFAQ

uantitativeAnalysisE-Book4o

f8


Discuss the issues surrounding selection of the appropriate sample size

It's all starting to become a little confusing. Which distribution do you use?

When a large sample size (generally bigger than 30 samples) is used, a z table can always be used to

construct the confidence interval. It does not matter if the population distribution is normal, or if the

population variance is known or not. This is because the central limit theorem assures that when the sample

is large, the distribution of the sample mean is approximately normal. However, the t-statistic is more

conservative because the t-statistic tends to be greater than the z statistic, and therefore using t-statistic willresult in a wider confidence interval.

However, if there is only a small sample size, a t table has to be used to construct the confidence interval

when the population distribution is normal and the population variance is not known.


34/40

CFAQ


f8


If the population distribution is not normal, there is no way to construct a confidence interval from a small

sample (even if the population variance is known).

Therefore, all else equal, you should try to select a sample larger than 30. The larger the sample size, the

more precise the confidence interval.

In general, at least one of the following is needed:

A normal distribution for the population.

A sample size that is greater than or equal to 30.

If one or both of the above occur, then a z-table or t-table is used, dependent upon whether is known or

unknown. If neither of the above occurs, then the question cannot be answered.


35/40

CFAQ


f8


A summary of the situation is as follows:

If the population is normally distributed, and the population variance is known, use a z-score

irrespective of sample size.

If the population is normally distributed, and the population variance is unknown, use a t-score

irrespective of sample size.

If the population is not normally distributed, and the population variance is known, use a z score

only if n >= 30, otherwise it cannot be done.

If the population is not normally distributed, and the population variance is unknown, use a t-

score only if n >= 30, otherwise it cannot be done.


36/40

CFAQ


f8


6. Common biases in sampling methods.

As has already been mentioned repeatedly, if there are problems with the choice of sample, then the

conclusions that are drawn from a sample could be in error.

There are a number of different types of bias that can creep into samples. It is important to be aware of

them, and have the ability to comment on their possible appearance in the data where appropriate.

Data-snooping bias is the bias in the inference drawn as a result of prying into the empirical results of

others to guide your own analysis.

Finding seemingly significant but in fact spurious patterns in the data is a serious problem in financial

analysis. Although it afflicts all non-experimental sciences, data-snooping is particularly problematic for

financial analysis because of the large number of empirical studies performed on the same datasets. Given

enough time, enough attempts, and enough imagination, almost any pattern can be teased out of any

dataset. In some cases, these spurious patterns are statistically small, almost unnoticeable in isolation. But

because small effects in financial calculations can often lead to very large differences in investment

performance, data-snooping biases can be surprisingly substantial.


37/40

CFAQ


f8


For example, after examining the empirical evidence from 1986 to 2002, Professor Minard concludes that a

growth investment strategy produces superior investment performance. After reading about ProfessorMinard's study, Monica decides to conduct a research of growth versus value investing based on the same

or related historical data used by Professor Minard. Monica's research is subject to data-snooping bias

because, among other things, the data used by Professor Minard may be spurious.

The best way to avoid data-snooping bias is to examine new data. However, data-snooping bias is difficult

to avoid because investment analysis is typically based on historical or hypothesized data.

Data-snooping bias can easily lead to data-mining bias.

Data-mining is the practice of finding forecasting models by extensive searching through databases for

patterns or trading rules (i.e. repeatedly "drilling" in the same data until you find something). It has a very

specific definition: continually mixing and matching the elements of a database until one "discovers" two

more or more data series that are highly correlated. Data-mining also refers more generically to any of a

number of practices in which data can be tortured into confessing anything.


38/40

CFAQ


f8


Two signs may indicate the existence of data-mining in research findings about profitable trading

strategies:

1. Many of the variables actually used in the research are not reported. These terms may indicate

that the researchers were searching through many unreported variables.

2. There is no plausible economic theory available to explain why those strategies work.

To avoid data-mining, analysts should use out-of sample data to test a potentially profitable trading rule.

That is, analysts should test the trading rule on a data set other than the one used to establish the rule.

Sample selection bias occurs when data availability leads to certain assets being excluded from the

analysis. The discrete choice has become a popular tool for assessing the value of non-market goods.

Surveys used in these studies frequently suffer from large non-response which can lead to significant bias

in parameter estimates and in the estimate of mean


39/40

CFAQ


f8


Survivorship bias is the most common type of sample selection bias. It occurs when studies are conducted on

databases that have eliminated all companies that have ceased to exist (often due to bankruptcy). The findings

from such studies most likely will be upwardly biased, since the surviving companies will look better than

those that no longer exist For example many mutual those that no longer exist. For example, many mutual fund

databases provide historical data about only those funds that are currently in existence. As a result, funds that

have ceased to exist due to closure or merger do not appear in these databases. Generally, funds that have

ceased to exist have lower returns relative to the surviving funds. Therefore, the analysis of a mutual fund

database with survivorship bias will overestimate the average mutual fund return because the database only

includes the better-performing funds. Another example is the return data on stocks listed on an exchange as it is

subject to survivorship bias: it's difficult to collect information on delisted companies and these companiesoften have poor performance.

Look-ahead bias exists when studies assume that fundamental information is available when it is not. For

example, researchers often assume a person had annual earnings data in January; in reality the data might not

be available until March. This usually biases results upwards.

Time period bias occurs when a test design is based on a time period that may make the results time-periodspecific. Even the worst performers have months or even years in which they look wonderful. After all, stopped

clocks are right twice a day. To eliminate strategies that have just been lucky, research must encompass many

years. However, if the time period is too long, the fundamental economic structure may have changed during

the time frame resulting in two data changed during the time frame, resulting in two data sets that reflect

different relationships.


40/40

CFAQ


f8 For FREE Resources

https://www.educorporatebridge.com/freebies.php

Corporate Bridge Blog

Finance News, Articles, Interview Tips etc

https://www.educorporatebridge.com/blog

For Online Finance Courses

For any other enquiry / information

Email [email protected]

https://www.educoporatebridge.com

Disclaimer Please refer to the updated curriculum of CFA level 1 for further information
https://www.educorporatebridge.com/freebies.phphttps://www.educorporatebridge.com/blogmailto:[email protected]://www.educorporatebridge.com/https://www.educorporatebridge.com/mailto:[email protected]://www.educorporatebridge.com/bloghttps://www.educorporatebridge.com/freebies.php

cfa level 1 quantitative analysis e book - part 4(1)

Documents