1 csi5388: functional elements of statistics for machine learning part i

23
1 CSI5388: CSI5388: Functional Elements Functional Elements of Statistics for of Statistics for Machine Learning Machine Learning Part I Part I

Upload: roger-pope

Post on 29-Dec-2015

225 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 CSI5388: Functional Elements of Statistics for Machine Learning Part I

11

CSI5388:CSI5388:Functional Elements of Functional Elements of Statistics for Machine Statistics for Machine

Learning Learning

Part IPart I

Page 2: 1 CSI5388: Functional Elements of Statistics for Machine Learning Part I

22

Part I (This set of lecture notes):Part I (This set of lecture notes):• Definition and PreliminariesDefinition and Preliminaries• Hypothesis Testing: Parametric ApproachesHypothesis Testing: Parametric Approaches

Part II (The next set of lecture notes)Part II (The next set of lecture notes)• Hypothesis Testing: Non-Parametric Hypothesis Testing: Non-Parametric

ApproachesApproaches• Power of a TestPower of a Test• Statistical Tests for Comparing Multiple Statistical Tests for Comparing Multiple

ClassifiersClassifiers

Contents of the LectureContents of the Lecture

Page 3: 1 CSI5388: Functional Elements of Statistics for Machine Learning Part I

33

Definitions and Preliminaries IDefinitions and Preliminaries I

A A Random VariableRandom Variable is a function, which is a function, which assigns unique numerical values to all assigns unique numerical values to all possible outcomes of a random possible outcomes of a random experiment under fixed conditions. experiment under fixed conditions.

If X takes on N values xIf X takes on N values x11, x, x22, .. x, .. xNN, such that , such that each xeach xii єє R, then, R, then,

The Mean of X is The Mean of X is The Variance isThe Variance is The Standard Deviation isThe Standard Deviation is

Page 4: 1 CSI5388: Functional Elements of Statistics for Machine Learning Part I

44

Sample VarianceSample Variance Sample Standard DeviationSample Standard Deviation

Definitions and Preliminaries IIDefinitions and Preliminaries II

Page 5: 1 CSI5388: Functional Elements of Statistics for Machine Learning Part I

55

Hypothesis TestingHypothesis Testing

GeneralitiesGeneralities Sampling DistributionsSampling Distributions ProcedureProcedure One- versus Two-tailed testsOne- versus Two-tailed tests Parametric approachesParametric approaches

Page 6: 1 CSI5388: Functional Elements of Statistics for Machine Learning Part I

66

GeneralitiesGeneralities

Purpose:Purpose: If we assume a given sampling If we assume a given sampling distribution, we want to establish whether distribution, we want to establish whether or not a sample result is representative of or not a sample result is representative of the sampling distribution or not. This is the sampling distribution or not. This is interesting because it helps us decide interesting because it helps us decide whether the results we obtained on an whether the results we obtained on an experiment can generalize to future data.experiment can generalize to future data.

Approaches to Hypothesis Testing:Approaches to Hypothesis Testing: There are two different approached to There are two different approached to hypothesis testing: hypothesis testing: ParametricParametric and and Non-Non-ParametricParametric approaches approaches

Page 7: 1 CSI5388: Functional Elements of Statistics for Machine Learning Part I

77

Sampling Distributions Sampling Distributions

Definition:Definition: The The sampling distribution of a of a statistic (example, the mean, the median or any statistic (example, the mean, the median or any other description/summary of a data set) is the other description/summary of a data set) is the distribution of values obtained for that statistics distribution of values obtained for that statistics over all possible samplings of the same size from a over all possible samplings of the same size from a given population.given population.

Note: Since the populations under study are Since the populations under study are usually infinite or at least, very large, the true usually infinite or at least, very large, the true sampling distribution is usually unknown. sampling distribution is usually unknown. Therefore, rather than finding its exact value, it will Therefore, rather than finding its exact value, it will have to be estimated. Nonetheless, we can do so have to be estimated. Nonetheless, we can do so quite well, especially when considering the mean quite well, especially when considering the mean of the data of the data

Page 8: 1 CSI5388: Functional Elements of Statistics for Machine Learning Part I

88

Procedure IProcedure I Idea:Idea: If we assume a given sampling distribution, If we assume a given sampling distribution,

we want to establish whether or not a sample we want to establish whether or not a sample result is representative of the sampling result is representative of the sampling distribution or not. This is interesting because it distribution or not. This is interesting because it helps us decide whether the results we obtained helps us decide whether the results we obtained on an experiment can generalize to future data.on an experiment can generalize to future data.

Example: If a sample mean we obtain on a Example: If a sample mean we obtain on a particular data sample is representative of the particular data sample is representative of the sampling distribution, then we can conclude that sampling distribution, then we can conclude that our data sample is representative of the whole our data sample is representative of the whole population. If not, it means that the values in our population. If not, it means that the values in our sample are unrepresentative. (Perhaps this sample are unrepresentative. (Perhaps this sample contained data that were particularly sample contained data that were particularly easy or particularly difficult to classify). easy or particularly difficult to classify).

Page 9: 1 CSI5388: Functional Elements of Statistics for Machine Learning Part I

99

Procedure IIProcedure II1.1. State your research hypothesisState your research hypothesis

2.2. Formulate a null hypothesis stating the opposite of your Formulate a null hypothesis stating the opposite of your research hypothesis. In particular, the null hypothesis research hypothesis. In particular, the null hypothesis regards the relationship between the sampling statistics regards the relationship between the sampling statistics of the basic population and the sample result you of the basic population and the sample result you obtained from your specific set of data. obtained from your specific set of data.

3.3. Collect your specific data and compute the statistic’s Collect your specific data and compute the statistic’s sample result on it. sample result on it.

4.4. Calculate the probability of obtaining the sample result Calculate the probability of obtaining the sample result you obtained if the sample emanated from the data set you obtained if the sample emanated from the data set that gave you the original sample statistic.that gave you the original sample statistic.

5.5. If this probability is low, reject the null hypothesis, and If this probability is low, reject the null hypothesis, and state that the sample you considered does not emanate state that the sample you considered does not emanate from the data set that gave you the original sample from the data set that gave you the original sample statistic.statistic.

Page 10: 1 CSI5388: Functional Elements of Statistics for Machine Learning Part I

1010

One- and Two-Tailed TestsOne- and Two-Tailed Tests

If H0 is expressed as an equality, then If H0 is expressed as an equality, then there are two ways to reject H0. Either the there are two ways to reject H0. Either the statistic computed from your sample at statistic computed from your sample at hand is lower than the sampling statistics hand is lower than the sampling statistics or it is higher. If you are only concerned or it is higher. If you are only concerned about either lower or higher statistics, about either lower or higher statistics, then you should perform a one-tailed test. then you should perform a one-tailed test. If you are simultaneously concerned about If you are simultaneously concerned about the two ways in which H0 can be rejected, the two ways in which H0 can be rejected, then you should perform a two-tailed test. then you should perform a two-tailed test.

Page 11: 1 CSI5388: Functional Elements of Statistics for Machine Learning Part I

1111

Parametric Approaches to Parametric Approaches to Hypothesis TestingHypothesis Testing

The classical approach to hypothesis The classical approach to hypothesis testing is parametric. This means that in testing is parametric. This means that in order to be applied, this approach makes a order to be applied, this approach makes a number of assumptions regarding the number of assumptions regarding the distribution of the population and the distribution of the population and the available sample. available sample.

Non-parametric approaches, discussed Non-parametric approaches, discussed later do not make these strong later do not make these strong assumptions, although they do make some assumptions, although they do make some assumptions as well, as will be discussed assumptions as well, as will be discussed there.there.

Page 12: 1 CSI5388: Functional Elements of Statistics for Machine Learning Part I

1212

Why are Hypothesis Tests often Why are Hypothesis Tests often applied to means?applied to means?

Hypothesis tests are often applied to Hypothesis tests are often applied to means. The reason is that unlike for other means. The reason is that unlike for other statistics, the standard deviation of the statistics, the standard deviation of the mean is known and simple to calculate. mean is known and simple to calculate.

Since, without a standard deviation, Since, without a standard deviation, hypothesis testing could not be performed hypothesis testing could not be performed (since the probability that the sample (since the probability that the sample under consideration emanates from the under consideration emanates from the population that is represented by the population that is represented by the original sampling statistics is linked to this original sampling statistics is linked to this standard deviation), having access to the standard deviation), having access to the standard deviation is essential.standard deviation is essential.

Page 13: 1 CSI5388: Functional Elements of Statistics for Machine Learning Part I

1313

Why is the standard deviation of Why is the standard deviation of the mean easy to calculate? the mean easy to calculate?

Because of the important Because of the important Central Limit Central Limit TheoremTheorem which states that no matter how which states that no matter how your original population is distributed, if your original population is distributed, if you use large enough samples, then the you use large enough samples, then the sampling distribution of the mean of these sampling distribution of the mean of these samples approaches a normal distribution. samples approaches a normal distribution. If the mean of the original population is μ If the mean of the original population is μ and its standard deviation σ, then the and its standard deviation σ, then the mean of the sampling distribution is μ and mean of the sampling distribution is μ and its standard deviation σ/sqrt(N).its standard deviation σ/sqrt(N).

Page 14: 1 CSI5388: Functional Elements of Statistics for Machine Learning Part I

1414

When is the sampling distribution of When is the sampling distribution of the mean Normal? the mean Normal?

The number of samples necessary for the The number of samples necessary for the sampling distribution of the mean to approach sampling distribution of the mean to approach normal depends on the distribution of the parent normal depends on the distribution of the parent population. population.

If the parent population is normal, then the If the parent population is normal, then the sampling distribution of the mean is also normal. sampling distribution of the mean is also normal.

If the parent population is not normal, but If the parent population is not normal, but symmetrical and uni-modal, then the sampling symmetrical and uni-modal, then the sampling distribution of the mean will be normal, even for distribution of the mean will be normal, even for small sample sizes. small sample sizes.

If the population is very skewed, then, sample If the population is very skewed, then, sample sizes of at least 30 will be required for the sizes of at least 30 will be required for the sampling distribution of the mean to be normal.sampling distribution of the mean to be normal.

Page 15: 1 CSI5388: Functional Elements of Statistics for Machine Learning Part I

1515

How are hypothesis tests set up?How are hypothesis tests set up?t-testst-tests

Hypothesis Tests are used to find out Hypothesis Tests are used to find out whether a sample mean comes from a whether a sample mean comes from a sampling distribution with a specified sampling distribution with a specified mean. mean.

We will consider:We will consider:• One-sample t-testsOne-sample t-tests

μ, σ knownμ, σ known μ, σ unknownμ, σ unknown

• Two-sample t-tests Two-matched samples Two-independent samples

Page 16: 1 CSI5388: Functional Elements of Statistics for Machine Learning Part I

1616

One-sample t-testOne-sample t-testσ knownσ known

If σ is known, we can use the central limit theorem If σ is known, we can use the central limit theorem to obtain the sampling distribution of this to obtain the sampling distribution of this population’s mean (mean is μ and standard population’s mean (mean is μ and standard deviation is σ/sqrt(N)).deviation is σ/sqrt(N)).

Let X be the mean of our data sample, we computeLet X be the mean of our data sample, we compute z = (X – μ)/(σ/sqrt(N)) (1)z = (X – μ)/(σ/sqrt(N)) (1)

We find the probability that z is as large as the value We find the probability that z is as large as the value obtained from the z-table and then output this obtained from the z-table and then output this probability if we are solely interested in a one-tailed probability if we are solely interested in a one-tailed test and double it before outputting it if we are test and double it before outputting it if we are interested in a two-tailed test. interested in a two-tailed test.

If this output probability is smaller than .05, we If this output probability is smaller than .05, we would reject H0 at the .05 level of significance. would reject H0 at the .05 level of significance. Otherwise, we would state that we have no evidence Otherwise, we would state that we have no evidence to conclude that H0 does not hold.to conclude that H0 does not hold.

Page 17: 1 CSI5388: Functional Elements of Statistics for Machine Learning Part I

1717

What is the meanings and What is the meanings and purpose of z?purpose of z?

Normal distributions can all be easily mapped into Normal distributions can all be easily mapped into a single one, using a specific transformation.a single one, using a specific transformation.

This means that, in our hypothesis tests, we can This means that, in our hypothesis tests, we can use the same information about the sampling use the same information about the sampling distribution over and over (if we assume that our distribution over and over (if we assume that our population is normally distributed), no matter population is normally distributed), no matter what the mean and variance of our actual what the mean and variance of our actual population are.population are.

Any observation can be changed into a Any observation can be changed into a standard standard score, z, score, z, with respect to mean=0 and standard deviation =1, as follows: Z = (X-mean)/sd Z = (X-mean)/sd

Page 18: 1 CSI5388: Functional Elements of Statistics for Machine Learning Part I

1818

One-sample t-testOne-sample t-testσ unknownσ unknown

In most situations, σ, the variance of the population In most situations, σ, the variance of the population is unknown. In this case, we replace σ by s, the is unknown. In this case, we replace σ by s, the sample standard deviation, in equation (1) yielding sample standard deviation, in equation (1) yielding

t = (X – μ)/(s/sqrt(N)) (2)t = (X – μ)/(s/sqrt(N)) (2)

Because s is likely to under-estimate σ, and, thus, Because s is likely to under-estimate σ, and, thus, return a t-value larger than z would have been had σ return a t-value larger than z would have been had σ been known, it is inappropriate to use the distribution been known, it is inappropriate to use the distribution of z to accept or reject the null hypothesis. of z to accept or reject the null hypothesis.

Instead, we use the Student’s t distribution, which Instead, we use the Student’s t distribution, which corrects for this problem and compares t to the t-corrects for this problem and compares t to the t-table with degree of freedom N-1. We then proceed table with degree of freedom N-1. We then proceed as we did for z on the slide about σ known, above.as we did for z on the slide about σ known, above.

Page 19: 1 CSI5388: Functional Elements of Statistics for Machine Learning Part I

1919

What is the meanings and What is the meanings and purpose of t?purpose of t?

t follows the same principle as z except for t follows the same principle as z except for the fact that t should be used when the the fact that t should be used when the standard deviation is unknown.standard deviation is unknown.

t, however, represents a family of curves t, however, represents a family of curves rather than a single curve. The shape of rather than a single curve. The shape of the t distribution changes from sample the t distribution changes from sample size to sample size. size to sample size.

As the sample size grows larger and As the sample size grows larger and larger, t looks more and more like a larger, t looks more and more like a normal distributionnormal distribution

Page 20: 1 CSI5388: Functional Elements of Statistics for Machine Learning Part I

2020

Assumption of the t-test with Assumption of the t-test with σ unknownσ unknown

Please, note that one assumption is made in the Please, note that one assumption is made in the use of the t-test. That is that we assume that the use of the t-test. That is that we assume that the sample was drawn from a normally distributed sample was drawn from a normally distributed population. population.

This is required because the derivation of t by This is required because the derivation of t by Student was based on the assumption that the Student was based on the assumption that the mean and variance of the population were mean and variance of the population were independent, an assumption that is true in the independent, an assumption that is true in the case of a normal distribution. case of a normal distribution.

In practice, however, the assumption about the In practice, however, the assumption about the distribution from which the sample was drawn can distribution from which the sample was drawn can be lifted whenever the sample size is sufficiently be lifted whenever the sample size is sufficiently large to produce a normal sampling distribution of large to produce a normal sampling distribution of the mean. In general, n= 25 or 30 (number of the mean. In general, n= 25 or 30 (number of cases in a sample) is sufficiently large. Often, it cases in a sample) is sufficiently large. Often, it can be smaller than that.can be smaller than that.

Page 21: 1 CSI5388: Functional Elements of Statistics for Machine Learning Part I

2121

Two-sample t-testsmatched samples

Given two matched population, we want to test Given two matched population, we want to test whether the difference in means between these whether the difference in means between these two populations are significant or not. We do so two populations are significant or not. We do so by looking at the difference in means, D, and by looking at the difference in means, D, and variance, SD, between these two populations and variance, SD, between these two populations and comparing it to the mean of 0. comparing it to the mean of 0.

We can then apply the t-test as we did above, in We can then apply the t-test as we did above, in the case where σ was unknown.the case where σ was unknown.

This time, we have This time, we have t = (D – 0)/ (SD/sqrt(n)) (3)t = (D – 0)/ (SD/sqrt(n)) (3) We use the t-table as before with a n-1 degree of We use the t-table as before with a n-1 degree of

freedom, and the same assumptions about the freedom, and the same assumptions about the normality of the distribution.normality of the distribution.

Page 22: 1 CSI5388: Functional Elements of Statistics for Machine Learning Part I

2222

Two-sample t-testsindependent samples

This time, we are interested in comparing This time, we are interested in comparing two populations with different means and two populations with different means and variance. The two populations are variance. The two populations are completely independent. completely independent.

We can, again apply the t-test, with the same We can, again apply the t-test, with the same conditions applying, using the formula:conditions applying, using the formula:

t= (X1 –X2)/ sqrt((s1t= (X1 –X2)/ sqrt((s122/n1) + (s2/n1) + (s222/n2)) /n2))

Page 23: 1 CSI5388: Functional Elements of Statistics for Machine Learning Part I

2323

Confidence IntervalsConfidence Intervals Sample means represent point estimates of the mean Sample means represent point estimates of the mean

parameter.Here, we are interested in interval parameter.Here, we are interested in interval estimates, which tell us how large or small the true estimates, which tell us how large or small the true value of μ could be without causing us to reject H0, value of μ could be without causing us to reject H0, given that we ran a t-test on the mean of our sample.given that we ran a t-test on the mean of our sample.

To calculate these intervals, we simply take the To calculate these intervals, we simply take the equations presented on the previous slides and express equations presented on the previous slides and express them in terms of μ, and as a function of t. them in terms of μ, and as a function of t.

We then replace t for the two-tailed value we are We then replace t for the two-tailed value we are interested in in the t-table. This value can be positive or interested in in the t-table. This value can be positive or negative, meaning that we will obtain two values for μ: negative, meaning that we will obtain two values for μ: μμupperupper and μ and μlowerlower. This gives us the limits of the confidence . This gives us the limits of the confidence interval.interval.

The confidence interval means that μ has a certain The confidence interval means that μ has a certain probability (attached to the value of t chosen) to belong probability (attached to the value of t chosen) to belong to this interval. The greater the size of the interval, the to this interval. The greater the size of the interval, the greater the probability that μ is included. Conversely, greater the probability that μ is included. Conversely, the smaller that interval, the smaller the probability the smaller that interval, the smaller the probability that it is included.that it is included.