vorlesung generalized linear regressionmodels

Vorlesung Generalized Linear Regressionmodels

Antonia Rom

2VO, Wien, 2014-05-15

Chapter 4 - Modeling of Binary Data

IntroductionWhat is important in modeling?Problems, Obstacles

4.1 Maximum Likelihood EstimationWhat is the ML-estimation?Single Binary ResponseGrouped DataAsymptotic PropertiesExistence of ML-EstimatesEstimation conditioned on predictor values

3VO, Wien, 2014-05-15

Introduction

A generalized linear model consists of:

- a probability distribution from the exponential family

- a linear predictor η = Xβ

- a link function (with a response function h such that E(Y) = h-1(η))

Binary regression model

h is a fully specified function.

In this chapter the logit model is used. In this case h is the logistic distribution function

Linear predictor

Response functionProbability

4VO, Wien, 2014-05-15

The link function is the inverse of the response function = h’

It determines functional form of the response probabilities.

The Linear predictor determines which variables are included and in what form they determine the response - The unknown parameters, β, can be estimated with maximum likelihood estimation.

The maximum likelihood estimation is a iterative algorithm.

-> Linear predictors can contain polynomial versions of continuous variables, dummy variables and interaction effects.

Care should be taken when specifying constituents of the model like

the linear predictors!

Introduction

5VO, Wien, 2014-05-15

• Discrepancy between data and model. Does the fit of the model support the inferences drawn in the model?

• Relevance of variables and form of the linear predictor. Which variables should be included and how?

• Explanatory power of the covariates

• Prognostic power of the model

• Choice of link funkction. Which link funktion fits the data well and has a simple interpretation!

Aspects are not independent:

Model should present appropriate approximation with simple predictor, specification determines the goodness-of-fit

linear predictor aims at finding an adequate form of covariates, reducing variable set, explanatory value aims at quantifying the effect of the covariates within the model

First chapter about estimation - Maximum likelihood estimation!!

Introduction

6VO, Wien, 2014-05-15

Maximum-Likelihood Estimation

Basic principle is to construct the likelihood of the unknown parameters for the sample data. (Which parameter (mean, variance) makes the sample the most likely.) The distribution has to be known!

The likelihood represents the joint probability or probability density of the observed data, considered as a function of the unknown parameters.

What does this mean in praxis?

Example:

2 MP3 – Players , exact the same, only shuffle mode without display!

1 with 5 songs 1 with 20 songs

Each MP3-Player contains your favorite song,

Unfortunately you mixed both of them. So you take one, turn it on and your favorite song is played.

If you would have to bet, which one would you choose? – The one with 5 songs!!!

7VO, Wien, 2014-05-15

An event A happened. One tries to find inference on an underlying variable B (e.g. a special parameter). Therefore one looks on the conditional probability for A for all possible estimations ˆbi of B, if ˆbi is true. The value of ˆbi, for which P(A|ˆbi) is a maximum, is the best predictor for b.

The conditional probability P(A|ˆbi) counts for the given event A. P(A|ˆbi) is also called L(ˆbi) (Likelihood of ˆbi).

The ML-estimator is the value, for which the likelihood is a maximum. -> therefore the name Maximum Likelihood.


8VO, Wien, 2014-05-15

If there are too many ˆbi, it is not possible to calculate every L(ˆbi).In this case a Likelihood-Function is built and the maximum is calculated with L’(ˆb) = 0.

Fact (1)The likelihood L(X) is not the probability for the event x to happen, but the conditional probability for the already happened event y, if x already happened before.

L(X) = P(Z|X)

Fact (2)Sum of all Likelihoods is not 1.


9VO, Wien, 2014-05-15

The likelihood for the response is given by

Observations are considered independent. The maximum likelihood of β are those values of β^ that maximizes the likelihood.

L values can get very small so log-likelihood is used instead

The value β^ can be obtained by solving the system of equations

Derivatives are the so-called score function s(β)

(iterative solving)

MLE – Definition: Single binary response

10VO, Wien, 2014-05-15

MLE – Definition: grouped data – binomially distributed responses

Several, independent binary responses

P is assumed to depend on x only, the mean is assumed to be the same for all the binary observations at this value.

The model has the form:

For the collection of binary variables the likelihood has the form

The likelihood for the number of success defined as Lbin(β) and the binary observation likelihood L(β) differ in the binomial factor , which is irrelevant during maximization, because it doesn‘t depend on β. Therefore the log-likelihood is:

The score function of the logit model is:

11VO, Wien, 2014-05-15

MLE – Definition: Asymptotic Properties

The MLE has some favorable properties.

The MLE estimator exists and is unique asymptotically.

It is consistent and asymptotically normally distributed.

It is asymptotically efficient.

Consistency

Likelihood is a smooth function and behaves in a nice way, and it‘s maximum is achieved in a unique point

Two functions Ln and L are getting closer, the points of each maximum should also get closer which exactly means that 0

ˆ

12VO, Wien, 2014-05-15

MLE – Definition: Asymptotic Properties

Asymptotic normality:

The estimator not only converges to the unknown parameter, but it converges fast enough.

In MLE theory the asymptotic variance of the estimator is determined by the information or the Fischer-Matrix

For binary data

For grouped data

13VO, Wien, 2014-05-15

MLE – Definition: Existence of maximum Likelihood Estimates

For a finite sample size it may happen, that ML estimators do not exist.

ML-Estimates do not exist, when you have a data set with complete separation

ML-Estimates may not exist, if you have a data set with quasi-complete separation.

ML-Estimates do exist, when you have a data set with overlap.

ML- Estimates do exist, when you have a data set with linear dependency.

ML-Estimates exist, if there is no hyper plane that separates the 0 and the 1 responses.

14VO, Wien, 2014-05-15

MLE – Definition: Estimation conditioned on Predictor Values

Sometimes samples can be conditional on the response y.

In such stratified samples one observes x values given at y=1 and x values given at y=0.

A common case is case-control studies in biomedicine., where y = 1(cases) and y=0 (controls)

(choice-based sampling in econometrics)

Let us consider the most simple case of binary predictor with y={0,1} and x={0,1}

with

is the odds ratio, which contains the association between y and x

Parameter of association is the same e estimate coefficient β of the original logit model

15VO, Wien, 2014-05-15

This way might be motivated by the specific structure of the logit model.

We go back to chapter 2.2.2 Derivation of the binary logit model to assume that perdictors are normally distributed.

denoting the density given y=r

denoting the marginal probability

From the Bayer‘s theorem, follows:

Therefore or

holds.

This shows that a logit model holds if has a linear form and contains

and only the intercept depends on the marginal probabilities.

The important point is, that the marginals determine only the intercept!


16VO, Wien, 2014-05-15

The likelihood for a given y differs from the likelihood given predictors.

By using one obtains for the log-likelihood conditional on y

Equivalent to the conditional log-likelihood

Marginal distribution of x(can be maximized by empirical distribution)

Marginal distribution of y (fixed by the sampling)


17VO, Wien, 2014-05-15

general binary model:

link function and linear predictor

Care should be taken when estimating these constituents!

Maximum – Likelihood Estimation

Basic principle is to construct the likelihood of the unknown parameters for the sample data!

MLE can cope with difficult and complicated linear predictors (interactions, dummy variables etc.)

iterative algorithm

Properties of MLE

It is consistent and asymptotically normally distributed.

It is asymptotic efficient. (Fischer-Matrix)

Maximum-Likelihood estimators might not exist. They do exist when the data set has overlap or linear dependency.

Depending on the data set, ML can also be conditional on the response y.

Summary

18VO, Wien, 2014-05-15

Thank you for your attention!

19VO, Wien, 2014-05-15

Man beachte den feinen Unterschied: für die Wahrscheinlichkeitsfunktion interessierten wir uns, weil sie uns die Eintrittswahrscheinlichkeiten von Realisationen für gegebene Parameter θ angibt. Bei der Likelihoodfunktion nehmen wir die Stichprobe als gegeben an und interessieren uns für den unbekannten Parameter θ, der dieRealisation der gegebenen Stichprobe ‘am wahrscheinlichsten’ macht!

20VO, Wien, 2014-05-15

21VO, Wien, 2014-05-15

22VO, Wien, 2014-05-15

MLE - Example

23VO, Wien, 2014-05-15

You can now calculate the probability of Bryant scoring the amounts he actually scored.

Basic principle of MLE!!!

to construct the likelihood of the unknown parameters for the sample data

Let f(ε) denote the density function for ε. (Recall that the density function is like a probability function, and that the density for a normal variable is a bell curve with its maximum at ε=0.)

Given the prediction M and the density function, you can compute the probability of Bryant scoring any particular point total Y. This is given by the formula f(Y-M) = f(ε).

- For example, if you believe that M=32, then the probability that Bryant scores 35 is given by f(35-32) = f(3). - If σ=6, for example, then examination of the normal table reveals f(3) = 08

Assume that Bryant’s scoring in one game is independent of what he scored in the prior game. - Recall that the probability of two independent events occurring is just the product of the probability that each occurs. - It follows that the probability, or likelihood, of Bryant scoring exactly 33, 22, 25, 40, and 30 points is

just the product of the probabilities of his getting each of these scores.

Given any prediction M, you can write the likelihood score as: Likelihood score = L = f(33-M) · f(22-M) · f(25-M) · f(40-M) · f(30-M).

MLE - Example

24VO, Wien, 2014-05-15

You want to find “maximum likelihood estimator” (MLE) of M!-This is the value of M that maximizes L

- Intuitively, you know that the MLE of M would not be 15 or 50 or some number far from his typical scoring output. It is almost impossible that a player who is predicted to score 15 points per game would actually score 33, 22, 25 40, and 30. -In fact, if M = 15 and σ= 6, then -L= f(33-15) · f(22-15) · f(25-15) · f(40-15) · f(30-15) = f(18) · f(7) · f(10) · f(25) · f(15) < .0000001

But 32 might be a good candidate to be the MLE. Someone predicted to score 32 points per game has a reasonable chance of scoring 33, 22, 25, 40, and 30.

- In this case, L= f(1) · f(-10) · f(-7) · f(8) · f(-2) ≈.00005- It turns out that MLE estimate of M is given by the mean of the realized values of Y. That is, M = 30 and L= .00014

MLE - Example

25VO, Wien, 2014-05-15

26VO, Wien, 2014-05-15

vorlesung generalized linear regressionmodels

Documents