prof. dr. s. k. bhattacharjee department of statistics university of rajshahi

Prof. Dr. S. K. Bhattacharjee Department of Statistics

University of Rajshahi

Statistical Inference• Statistical inference is the process of

making judgment about an unknown population based on sample.

• An important aspect of statistical inference is using estimates to approximate the value of an unknown population parameter.

• Another type of inference involve choosing between two opposing views or statements about the population; this process is called hypothesis testing.

Statistical EstimationAn estimator is a statistical parameter that

provides an estimation of a population parameter.

Point EstimationInterval Estimation

Point EstimationA point estimator is a single numerical

estimate of a population parameter. The sample mean, is a point estimator of

the population mean, μ. The sample proportion, p is a point

estimator of the population proportion, π.

, .

Properties of a good Estimator Principles of Parameter

Estimation Unbiased

– The expected value of the estimate is equal to population parameter

Consistent– As n (sample size) approaches N (population size),estimator converges to the population parameter

Efficient– With the smallest variance.

Minimum Mean-Squared Error -Variance of estimator be as low as possible.

Sufficient– Contains all information about the parameter througha sample of size n

Unbiased EstimatorAn unbiased estimator

is a statistics that has an expected value equal to the population parameter being estimated.

E[θ]n = θ0 for any nExamples:The sample mean, is an unbiased estimator

of the population mean, μ.The sample variance is an unbiased

estimator of the population variance,

Consistent EstimatorsA statistics is a consistent estimator of a parameter

if its probability that it will be close to the parameter's true value approaches 1 with increasing sample size.

The standard error of a consistent estimator becomes smaller as the sample size gets larger.

The sample mean and sample proportions are consistent estimators, since from their formulas as n gets bigger, the standard errors become smaller.

and

Mathematically, a sequence of estimators {tn; n ≥ 0} is a consistent estimator for parameter θ if and only if, for all ϵ > 0, no matter how small, we have

Consistent EstimatorsAn estimator's distribution (like that of any any other non trivial statistic) becomes narrower and narrower, and more and more normal-like as larger and larger samples are considered. If we take for granted the fact that the variance of the estimator will tend to 0 as the sample size grows without limit, what consistency really means is that the mean of the estimator's distribution tends to θ0 as the sample size grows without limit, as shown in the upper and lower images below :

In technical terms, a consistent estimator is a sequence of random variables indexed by n (the sample size) that converge in probability to θ0.

Relative EfficiencyA parameter may have several unbiased estimators. For example, given a symmetrical continuous distribution, both : * The sample mean and * The sample median are unbiased estimators of the distribution mean (when it exists). Which one should we choose ?

Certainly we should choose the estimator that generates estimates that are closer (in a probabilistic sense) to the true value θ0 than estimates generated by the other one. One way to do that is to select the estimator with the lower variance.

This leads to the definition of the relative efficiency of two unbiased estimators. Given two unbiased estimators θ *

1 and θ *2 of

the same parameter θ , one defines the efficiency of θ *2 with respect

to θ *1 (for a given sample size n) as the ratio of their variances :

Relative efficiency (θ *2 with respect to θ *

1)n = Var(θ *1)n / Var(θ *

2 )n

Efficient Estimator

The estimator has a low variance, usually relative to other estimators, which is called relative efficiency. Otherwise, the variance of the estimator is minimized.

An efficient estimator consider the reliability of the estimator in terms of its tendency to have a smaller standard error for the same sample size when compared each other

The median is an unbiased estimator of μ when the sample distribution is normally distributed; but is standard error is 1.25 greater than that of the sample mean, so the sample mean is a more efficient estimator than the median.

The Maximum Likelihood Estimator is the most efficient estimator among all the unbiased ones.

Minimum Mean-Squared Error Estimator

The practitioner is not particularly keen on unbiasedness. What is really important is that, on the average, the estimate θ* be close to the true value θ 0. So he will tend to favour estimators such that the mean-square error :

E[(θ* - θ0 )]²

be as low as possible, whether θ * is biased or not. Such an estimator is called a minimum mean-square-error estimator. Given two estimators :

θ *1: that is unbiased, but with a large variance, θ *2 : that is somewhat biased, but with a small variance,

θ *2 might prove a better estimator than θ *1 in practice .

Minimum Mean-Squared Error Estimator

Sufficient EstimatorWe have shown that and are unbiased

estimators of μ and .

Are we loosing any information about our target parameters relying on these statistics?

The statistics, that summarizes all the

information about target parameters are said to have the property of sufficiency, or they are called sufficient statistics.

“Good” estimators are (or can be made to be)

functions of any sufficient statistic.

Sufficient Estimator*Let 1 2, ,..., nY Y Y denote a random sample from a probability

distribution with unknown parameter . Then the statistics u is said

to be sufficient for if the conditional distribution of 1 2, ,..., nY Y Y

given u does not depend on .

*Let u be a statistic based on the random sample 1 2, ,..., nY Y Y . Then

u is sufficient statistic for the estimation of a parameter if and only

if the likelihood 1 2, ,..., |nL y y y can be factored into two

nonnegative functions

1 2 1 2, ,..., | , , ,...,n nL y y y g u h y y y

Where ,g u is a function only of u and , and

1 2, ,..., nh y y y is not a function of .

Example : Sufficient Estimator

Methods of Point Estimation

Classical Approach. Bayesian Approach. Classical Approach: Method of Moment Method of Maximum Likelihood Method of Least Square

Method of Moments

i) Sample moments should provide good estimates of the corresponding population moments.

ii) Because the population moments are functions of population parameters, we can use i) to get these parameters

Formal Definition:

Choose as estimates those values of the parameters that are solutions of the equations ' '

k km , for 1,2,...,k t , where t is the number of parameters to

be estimated.

ExampleA random sample 1 2, ,..., nY Y Y is selected from a population in which iY possesses a uniform density

function over the interval 0, where is

unknown. Use the method of moments to estimate .

Solution

The value of '1 for a uniform random variable is

'1 2

The corresponding first sample moment is

'1

1

1 n

ii

m Y Yn

From which:

'1 2

Y

Thus,

ˆ 2Y

Method of Maximum LikelihoodThe likelihood and log-likelihood functions are the basis for deriving estimators for parameters, given data. While the shapes of these two functions are different, they have their maximum point at the same value. In fact, the value of parameter that corresponds to this maximum point is defined as the Maximum Likelihood Estimate (MLE). This is the value that is “mostly likely" relative to the other values. The maximum likelihood estimate of the unknown parameter in the model is that value that maximizes the log-likelihood, given the data.

Method of Maximum LikelihoodUsing calculus one could take the first partial

derivative of the likelihood or log-likelihood function with respect to the parameter(s), set it to zero and solve for parameter(s). This solution will give the MLE(s).

Method of Maximum LikelihoodIf x is a continuous random variable with pdf:

where are k unknown constant parameters which need to be estimated, conduct an experiment and obtain N independent observations, x1, x2,...,xN. Then the likelihood function is given by the following product:

The logarithmic likelihood function is given by:

The maximum likelihood estimators (MLE) of are obtained by maximizing L or .

By maximizing , which is much easier to work with than L, the maximum likelihood estimators (MLE) of are the simultaneous solutions of k equations such that:

Properties of Maximum Likelihood EstimatorsFor “large" samples (“asymptotically"), MLEs are

optimal.1. MLEs are asymptotically normally distributed.2. MLEs are asymptotically “minimum variance."3. MLEs are asymptotically unbiased (MLEs are

often biased, but the bias→ 0 as n → ∞.MLE is consistent The Maximum Likelihood Estimator is the most

efficient estimator among all the unbiased ones.

Maximum likelihood estimation represents the backbone of statistical estimation.

ExampleSuppose. Find the MLE of p.

The likelihood is

and the loglikelihood is

Taking derivatives and solving, we find

ExampleSuppose . Find the MLE of .

The likelihood is

and the loglikelihood is

Maximizing this equation,

Method of Least SquaresA statistical technique to determine the line of

best fit for a model. The least squares method is specified by an

equation with certain parameters to observed data.

This method is extensively used in regression analysis and estimation.

Ordinary least squares - a straight line is sought to be fitted through a number of points to minimize the sum of the squares of the distances (hence the name "least squares") from the points to this line of best fit.

Method of Least SquaresDefine the distance from the data point from the line, denoted by u, as follows:

Method of Least Squares

Example: Method of Least Squares

To illustrate the computations of b and a, refer to the following data. All the sums required are computed and shown here:

Interval EstimationEstimation of the parameter is not sufficient. It is necessary to analyse and see how confident we can be about this particular estimation. One way of doing it is defining confidence intervals. If we have estimated we want to know if the “true” parameter is close to our estimate. In other words we want to find an interval that satisfies following relation:

P{L ˂ μ ˂ U} ≥ 1- α

I.e. probability that “true” parameter is in the interval (L ,U) is greater than 1-. Actual realisation of this interval - (L ,U) is called a 100(1- )% of confidence interval, limits of the interval are called lower and upper confidence limits. 1- is called confidence level.

Example: If population variance is known (2) and we estimate population mean then

We can find from the table that probability of Z is more than 1 is equal to 0.1587. Probability of Z is less than -1 is again 0.1587. These values comes from the tables of the standard normal distribution.

)1,0( normal is /

Nn

xZ

Interval EstimationInterval estimation, Credible interval, and Prediction intervalConfidence intervals are one method of interval estimation, and the most widely used in Classical statistics. An analogous concept in Bayesian statistics is credible intervals, while an alternative Classical and Bayesian both methods is that of prediction intervals which, rather than estimating parameters, estimate the outcome of future samples.

An interval estimator of the sample mean can be expressed as the probability that the mean between two values. Interval estimation, “Confidence Interval”– use a range of numbers within which theparameter is believed to fall (lower bound,upper bound)– e.g. (10, 20)

Interval Estimation for the mean of a Normal Distribution

Confidence IntervalThe simplest and most commonly used formula for a binomial confidence interval relies on approximating the binomial distribution with a normal distribution. This approximation is justified by the central limit theorem. The formula is

where is the proportion of successes in a Bernoulli trial process estimated from the statistical sample, z1 − α / 2 is the 1 − α / 2 percentile of a standard normal distribution, α is the error percentile and n is the sample size. For example, for a 95% confidence level the error (α) is 5%, so 1 − α / 2 = 0.975 and z1 − α / 2 = 1.96.

Exponential DistributionThe 100(1 − α)% exact confidence interval for this estimate is given by[2]

which is also equal to:

where is the MLE estimate, λ is the true value of the parameter, and χ2p,ν is the

100(1 – p) percentile of the chi squared distribution with ν degrees of freedom

Bayesian EstimationBayesian statistics views every unknown as a random quantity. Bayesian statistics is a little more complicated in the simple cases than computing the Maximum Likelihood Estimate Suppose we have data from a

distribution. Our goal is to estimate the unknown . The first step in

Bayesian statistics is to select a prior distribution, , intended to represent prior information about the . Often, you don't have any available. In this case, the prior should be relatively diffuse. For example, if we are trying to guess the average height (in feet) of students at RU, we may know enough to realize the most student are between 5 and 6 feet tall, and therefore the mean should be between 5 and 6 feet, but we may not want to be more specific than that. We

wouldn't, for example, want to specify . Even though 5.6 feet may be a good guess, this prior places almost all its mass between 5.599995 and 5.600005 feet, indicating we are almost sure, before seeing any data, then the mean height is in this range. I'm personally not that sure, so I

might choose a much more diffuse prior, such as setting , indicating that I'm sure the mean height is between 5 and 6 feet but every value in there seems about as likely as any other.

Prior and Posterior DistributionThe tool for guessing at the parameters value with prior knowledge of parameter and data is called the posterior distribution, which is defined as the conditional distribution of the parameter given data, formally

where is the likelihood function.

The posterior is a distribution over and has all the usual properties of a distribution. In particular

Prior and Posterior Distribution

Prior and Posterior Distribution1. Although not guaranteed, in almost all practical situations the posterior

distribution provides a more refined guess of than the prior. We are combining our prior information with the information contained in the data to make better guesses about .

2. If we observe a large amount of data, the posterior distribution is determined almost exclusively by the data, and tends to place more and more mass near the true value of . Thus, we don't have to be too precise about specifying our prior distribution in advance. Any errors will tend to wash out as we observe more data.

Properties of Posterior MeanThe Bayes estimate of a parameter is the posterior

mean. Usually the posterior distribution will have some common distributional form (such as Gamma, Normal, Beta, etc.). Some things to remember about the posterior mean

The data only enter the equation for the posterior in terms of the likelihood function. Therefore, the parameters of the posterior distribution, and hence the posterior mean, are functions of the sufficient statistics.

Often the posterior mean has lower MSE than the MLE for portions of the parameter space, so its a worthwhile estimator to consider and compare to the MLE.

The posterior mean is consistent, asymptotically unbiased (meaning the bias tends to 0 as the sample size increases), and the asymptotic efficiency of the MLE compared to the posterior mean is 1. Actually, for large n the MLE and posterior mean are very similar estimators, as we will see in the examples.

Example: GeometricSuppose we wished to use a general prior. We would like a formula

for the posterior in terms of and . We proceed as before, finding the prior density to be

The likelihood is unchanged, so the product of the prior and likelihood simplifies is

The prior parameters and are treated as fixed constants.Thus the Gamma functions in front may be considered part of the normalizing constant C, leaving the kernel

This is the kernel of a distribution, with posterior mean

Example : Binomial

Example: PoissonLet . Suppose you have a prior on . Compute the posterior distribution of .

As stated above, our first goal is to compute and simplify and product of the likelihood and

prior. If the data are , then the likelihood is

and the prior density is

The posterior distribution

which simplifies to

Example: PoissonAll the , , and are constants since is the only thing random in this expression. The terms that involve are

Hence the posterior distribution is

distribution.

The Bayes estimate is the posterior mean. The posterior mean of a

distribution is

Notice that

The only terms that get large as n increases are and n. Thus, for large n, is approximately

, the MLE.

Example: Normal

Bayesian Interval Estimation

Prediction

Predictive Distribution : Binomial-Beta

Predictive Density : Normal-Normal

Predictive Distribution : Binomial-Beta