review course statistics probability theory - statistical

88
Review Course Statistics Probability Theory - Statistical Inference - Matrix Algebra Prof. Dr. Christian Conrad Heidelberg University Winter term 2012/13 Christian Conrad (Heidelberg University) Winter term 2012/13 1 / 88

Upload: others

Post on 08-Feb-2022

16 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Review Course Statistics Probability Theory - Statistical

Review Course StatisticsProbability Theory - Statistical Inference - Matrix Algebra

Prof. Dr. Christian Conrad

Heidelberg University

Winter term 2012/13

Christian Conrad (Heidelberg University) Winter term 2012/13 1 / 88

Page 2: Review Course Statistics Probability Theory - Statistical

Review Course Statistics

Christian ConradEmail: [email protected]

Tue 09.10. – Thu 11.11.1209.00 – 12.00 and 14.00 – 16.00HEU I

Slideshttp://elearning2.uni-heidelberg.de/⇒ 10_MScE1C: Ökonometrie (WS 2012/13)“Passwort: econometrics12_13”

Christian Conrad (Heidelberg University) Winter term 2012/13 2 / 88

Page 3: Review Course Statistics Probability Theory - Statistical

Econometrics

Lecture: Christian ConradTue, 9.00-12.00, Bergheimer Str. 58, Hörsaal

Office hours: Mo 11.00-12.00, Bergheimer Str. 58, 01.019aEmail: [email protected]

Tutorial: Matthias HartmannMon, 14.00-16.00Theory: Bergheimer Str. 58, 00.010STATA: Bergheimer Str. 58, 99.005-6

Wed, 14.00-16.00Theory: Grabengasse 3-5, NUni HS 10STATA: Bergheimer Str. 58, 99.005-6

Lecture notes, problem sets . . .http://elearning2.uni-heidelberg.de/⇒ 10_MScE1C: Ökonometrie (WS 2012/13)

Christian Conrad (Heidelberg University) Winter term 2012/13 3 / 88

Page 4: Review Course Statistics Probability Theory - Statistical

Review Course Statistics

Contents

1. Review of Probability Theory

2. Review of Statistics

3. Matrix Algebra

Christian Conrad (Heidelberg University) Winter term 2012/13 4 / 88

Page 5: Review Course Statistics Probability Theory - Statistical

Review Course Statistics

Literature

Stock, J. H. and M. W. Watson, Introduction to Econometrics, 3rd edition,Pearson, 2012.

Review of Probability Theory: Chapter 2 and Chapter 17.2, Appendix 17.1Review of Statistics: Chapter 3 and Chapter 17.2Matrix Algebra: Appendix 18.1

Christian Conrad (Heidelberg University) Winter term 2012/13 5 / 88

Page 6: Review Course Statistics Probability Theory - Statistical

1 Review of Probability Theory

1 Review of Probability Theory1.1 Random variables and probability distributions1.2 Expected values, mean, and variance1.3 Two random variables1.4 Random sampling and the distribution of the sample average1.5 Large-sample approximations to sampling distributions

Christian Conrad (Heidelberg University) Winter term 2012/13 6 / 88

Page 7: Review Course Statistics Probability Theory - Statistical

1 Review of Probability Theory1.1 Random variables and probability distributions

Random experiment

Our starting point is a random experiment with

mutually exclusive potential outcomes ω and

the set of all possible outcomes Ω, called the sample space.

An event A is a collection of outcomes, and, hence a subset of Ω.

The probability P(A) of an event is the proportion of the time the eventoccurs in the long run.

Example: Tossing a die once

Describe the sample space and the events A: "the outcome is an oddnumber" and B: "the outcome is an even number". What is P(A)?

Christian Conrad (Heidelberg University) Winter term 2012/13 7 / 88

Page 8: Review Course Statistics Probability Theory - Statistical

1 Review of Probability Theory1.1 Random variables and probability distributions

Random variables

A random variable X is a numerical summary of a random outcome,i.e. the random variable assigns a real number X(ω) = x to each outcomeω ∈ Ω. x is called the realization.X can be either discrete or continuous

a discrete random variable takes only a discrete set of values, like 0,1,2,. . .a continuous random variable takes on a continuum of possible values

Christian Conrad (Heidelberg University) Winter term 2012/13 8 / 88

Page 9: Review Course Statistics Probability Theory - Statistical

1 Review of Probability Theory1.1 Random variables and probability distributions

Cumulative density function (cdf)

The cumulative density function F is the probability that the random variable isless than or equal to a particular value x:

FX(x) = P(X ≤ x) = P(ω|X(ω) ≤ x)

Properties of the cdf:1 FX is nondecreasing in x.2 FX is right-continuous, that means lim

x→x0x>x0

FX(x) = FX(x0).

3 limx→−∞

FX(x) = 0, limx→∞

FX(x) = 1.

4 P((a, b]) = FX(b)− FX(a).

Christian Conrad (Heidelberg University) Winter term 2012/13 9 / 88

Page 10: Review Course Statistics Probability Theory - Statistical

1 Review of Probability Theory1.1 Random variables and probability distributions

Probability distributionThe probability distribution of a discrete random variable is the list of allpossible values, x1, x2, . . ., of the random variable and the probability that eachvalue will occur. It takes the form

P(X = xi) = pi, i = 1, 2, ...

where 0 ≤ pi ≤ 1 and∑

i pi = 1 so that

FX(xi) = P(X ≤ xi) =∑

xt≤xi

P(X = xt)

is a step function.Example: A random variable is Bernoulli distributed if the outcome is binarywith

X =

1 with probability p

0 with probability 1 − p.

Christian Conrad (Heidelberg University) Winter term 2012/13 10 / 88

Page 11: Review Course Statistics Probability Theory - Statistical

1 Review of Probability Theory1.1 Random variables and probability distributions

Probability density function (pdf)For a continuous random variable the probabilities are represented by theprobability density function (pdf) fX(x), such that the area under the pdfbetween any two points a and b (where a < b) is the probability that therandom variable falls between these to points:

P(a < X ≤ b) =∫ b

afX(x)dx

and

FX(x) = P(X ≤ x) =∫ x

−∞fX(u)du

A function fX(x) is a pdf if and only if fX(x) ≥ 0 for all x and∫∞−∞ fX(x)dx = 1.

Christian Conrad (Heidelberg University) Winter term 2012/13 11 / 88

Page 12: Review Course Statistics Probability Theory - Statistical

1 Review of Probability Theory1.1 Random variables and probability distributions

Christian Conrad (Heidelberg University) Winter term 2012/13 12 / 88

Page 13: Review Course Statistics Probability Theory - Statistical

1 Review of Probability Theory1.1 Random variables and probability distributions

Example:A continuous random variable X with density

fX(x) =1√

2πσX

exp

(

− (x − µX)2

2σ2X

)

with parameters µX and σX > 0 is said to be normally distributed. We use thenotation:

X ∼ N (µX , σ2X)

If µX = 0 and σ2X = 1 the random variable is said to be standard normal

distributed. In the case we denote the pdf and cdf by φ(x) and Φ(x).

Christian Conrad (Heidelberg University) Winter term 2012/13 13 / 88

Page 14: Review Course Statistics Probability Theory - Statistical

1 Review of Probability Theory1.1 Random variables and probability distributions

Christian Conrad (Heidelberg University) Winter term 2012/13 14 / 88

Page 15: Review Course Statistics Probability Theory - Statistical

1 Review of Probability Theory1.1 Random variables and probability distributions

Christian Conrad (Heidelberg University) Winter term 2012/13 15 / 88

Page 16: Review Course Statistics Probability Theory - Statistical

1 Review of Probability Theory1.1 Random variables and probability distributions

Christian Conrad (Heidelberg University) Winter term 2012/13 16 / 88

Page 17: Review Course Statistics Probability Theory - Statistical

1 Review of Probability Theory1.1 Random variables and probability distributions

Christian Conrad (Heidelberg University) Winter term 2012/13 17 / 88

Page 18: Review Course Statistics Probability Theory - Statistical

1 Review of Probability Theory1.2 Expected values, mean, and variance

Expected value

The expected value (or mean) of a random variable X, denoted by µX = E(X),is given by

E[X] =∑

i

xiP(X = xi),

if X is discrete and

E[X] =∫ ∞

−∞x · fX(x)dx

if X is continuous. It is the “long run average value of the random variable overmany repeated trials”.

Christian Conrad (Heidelberg University) Winter term 2012/13 18 / 88

Page 19: Review Course Statistics Probability Theory - Statistical

1 Review of Probability Theory1.2 Expected values, mean, and variance

Variance

The variance of a random variable X, denoted by σ2X = Var(X), is given by

Var[X] = E[(X − µX)2] =

i

(xi − µX)2P(X = xi),

if X is discrete and

Var[X] =∫ ∞

−∞(x − µX)

2 · fX(x)dx

if X is continuous. It is a measure of the dispersion or the “spread” of aprobability distribution. The square root of the variance is called the standarddeviation and denoted by σX.

The variance can be written as: Var[X] = E[X2]− (E[X])2.

Christian Conrad (Heidelberg University) Winter term 2012/13 19 / 88

Page 20: Review Course Statistics Probability Theory - Statistical

1 Review of Probability Theory1.2 Expected values, mean, and variance

Similarly, for any function g, we define the expectation E[g(X)] as

E[g(X)] =∑

i

g(xi)P(X = xi)

if X is discrete and

E[g(X)] =∫ ∞

−∞g(x)fX(x)dx

if X is continuous.

Jensen’s Inequality: If g(X) is a convex function, then

g(E[X]) ≤ E[g(X)].

In particular: (Expectation Inequality)

|E[X]| ≤ E[|X|]

Christian Conrad (Heidelberg University) Winter term 2012/13 20 / 88

Page 21: Review Course Statistics Probability Theory - Statistical

1 Review of Probability Theory1.2 Expected values, mean, and variance

Cauchy-Schwarz Inequality:

|E[XY]| ≤√

E[X2]E[Y2]

(for the Proof see Appendix 17.2)

Higher order moments:

For r = 1, 2, . . ., we define the r′th moment of X as

E(Xr)

and the r′th central moment of X as

E[(X − E(X))r].

Remark: If E[Xr] < ∞, then all the raw moments of order less than r also exist.

Christian Conrad (Heidelberg University) Winter term 2012/13 21 / 88

Page 22: Review Course Statistics Probability Theory - Statistical

1 Review of Probability Theory1.2 Expected values, mean, and variance

Skewness:

Skewness =E[(X − µX)

3]

σ3X

The skewness describes how much a distribution deviates from symmetry.For a symmetric distribution: Skewness = 0. The distribution has a long right(left) tail if Skewness > 0 (Skewness < 0).

Kurtosis:

Kurtosis =E[(X − µX)

4]

σ4X

The kurtosis of a distribution is a measure of how much mass is in its tails.The greater the kurtosis of a distribution, the more likely are outliers. Thekurtosis of a normally distributed random variable is 3. A distribution withkurtosis exceeding 3 is called leptokurtic or heavy-tailed.

Christian Conrad (Heidelberg University) Winter term 2012/13 22 / 88

Page 23: Review Course Statistics Probability Theory - Statistical

1 Review of Probability Theory1.2 Expected values, mean, and variance

Christian Conrad (Heidelberg University) Winter term 2012/13 23 / 88

Page 24: Review Course Statistics Probability Theory - Statistical

1 Review of Probability Theory1.2 Expected values, mean, and variance

Example: Calculate the mean and variance of a Bernoulli distributed randomvariable.

Example (E2.1): Let Y denote the number of “heads” that occur when twocoins are tossed.

1 Derive the probability distribution of Y.2 Derive the cumulative probability distribution of Y.3 Derive the mean and variance of Y.

Example: Consider the discrete random variable X and the functiong(X) = a + bX with a, b ∈ R. Derive E[g(X)] and Var[g(X)].

Example (E2.8): The random variable Y has a mean of 1 and a variance of 4.Let Z = 1

2 (Y − 1). Show that µZ = 0 and σ2Z = 1.

Christian Conrad (Heidelberg University) Winter term 2012/13 24 / 88

Page 25: Review Course Statistics Probability Theory - Statistical

1 Review of Probability Theory1.2 Expected values, mean, and variance

Example (E17.5): Suppose that W is a random variable with E[W4] < ∞.Show that E[W2] < ∞. [Hint: Calculate the variance of W2.]

Example (E2.21): X is a random variable with moments E[X], E[X2], E[X3] andso forth.

1 Show E[(X − µx)3] = E[X3]− 3 E[X2] E[X] + 2(E[X])3.

2 Show E[(X − µx)4] = E[X4]− 4 E[X] E[X3] + 6 (E[X])2 E[X2]− 3(E[X])4.

Christian Conrad (Heidelberg University) Winter term 2012/13 25 / 88

Page 26: Review Course Statistics Probability Theory - Statistical

1 Review of Probability Theory1.3 Two random variables

Joint and marginal distributions

The joint cdf of the random variables X and Y is given by

FX,Y(x, y) = P(X ≤ x, Y ≤ y).

The joint pdf fX,Y(x, y) of X and Y is given by

fX,Y(x, y) = P(X = x, Y = y)

if X and Y are discrete and by

fX,Y(x, y) =∂2

∂x∂yFX,Y(x, y)

if X and Y are continuous.

Christian Conrad (Heidelberg University) Winter term 2012/13 26 / 88

Page 27: Review Course Statistics Probability Theory - Statistical

1 Review of Probability Theory1.3 Two random variables

Christian Conrad (Heidelberg University) Winter term 2012/13 27 / 88

Page 28: Review Course Statistics Probability Theory - Statistical

1 Review of Probability Theory1.3 Two random variables

Suppose X and Y are discrete with outcomes x1, x2, . . . , xl and y1, y2, . . . , yk.Then the marginal probability distribution of Y is given

P(Y = yj) =

l∑

i=1

P(X = xi, Y = yj) for j = 1, . . . , k.

If X and Y are continuous the marginal density function of Y is given by

fY(y) =∫ ∞

−∞fX,Y(x, y)dx.

Christian Conrad (Heidelberg University) Winter term 2012/13 28 / 88

Page 29: Review Course Statistics Probability Theory - Statistical

1 Review of Probability Theory1.3 Two random variables

Conditional distributions/density

The conditional distribution/density of Y given X = x is given by

P(Y = y|X = x) =P(X = x, Y = y)

P(X = x)

if X and Y are discrete and

fY|X=x(y) =fX,Y(x, y)

fX(x)

if X and Y are continuous.

Christian Conrad (Heidelberg University) Winter term 2012/13 29 / 88

Page 30: Review Course Statistics Probability Theory - Statistical

1 Review of Probability Theory1.3 Two random variables

Christian Conrad (Heidelberg University) Winter term 2012/13 30 / 88

Page 31: Review Course Statistics Probability Theory - Statistical

1 Review of Probability Theory1.3 Two random variables

Conditional expectation

The conditional expectation of Y given X = x is given by

E(Y|X = x) =k∑

j=1

yjP(Y = yj|X = x)

if X and Y are discrete and

E(Y|X = x) =∫ ∞

−∞y · fY|X=x(y)dy

if X and Y are continuous.

Conditional variance

Var[Y|X = x] = E[(Y − E(Y|X = x))2|X = x]

Christian Conrad (Heidelberg University) Winter term 2012/13 31 / 88

Page 32: Review Course Statistics Probability Theory - Statistical

1 Review of Probability Theory1.3 Two random variables

Christian Conrad (Heidelberg University) Winter term 2012/13 32 / 88

Page 33: Review Course Statistics Probability Theory - Statistical

1 Review of Probability Theory1.3 Two random variables

The law of iterated expectations

simple law of iterated expectations:

E(Y) = E(E(Y|X))

extended law of iterated expectations:

E(Y|X) = E(E(Y|X, Z)|X)

in general: let x and w be random vectors with x = f (w) for some functionf :

E[Y|x] = E[E[Y|w]|x]finally:

E(g(X)Y|X) = g(X)E(Y|X)(for details see Wooldridge, Appendix 2A)

Christian Conrad (Heidelberg University) Winter term 2012/13 33 / 88

Page 34: Review Course Statistics Probability Theory - Statistical

1 Review of Probability Theory1.3 Two random variables

Independence

Two random variables are independent, if knowing the value of one of thevariables provides no information about the other. X and Y are independentlydistributed, if for all values x and y,

P(Y = y|X = x) = P(Y = y)

if X and Y are discrete andfY|X=x(y) = fY(y)

if X and Y are continuous. Alternatively, we can say that X and Y areindependently distributed, if the joint distribution equals the product of themarginal distributions, i.e.

P(X = x, Y = y) = P(X = x)P(Y = y)

fX,Y(x, y) = fX(x)fY (y)

Christian Conrad (Heidelberg University) Winter term 2012/13 34 / 88

Page 35: Review Course Statistics Probability Theory - Statistical

1 Review of Probability Theory1.3 Two random variables

Covariance and correlation

The covariance between X and Y is

σX,Y = Cov(X, Y) = E[(X − µX)(Y − µY)].

The correlation between X and Y is

ρXY = Corr(X, Y) =σXY

σXσY.

Christian Conrad (Heidelberg University) Winter term 2012/13 35 / 88

Page 36: Review Course Statistics Probability Theory - Statistical

1 Review of Probability Theory1.3 Two random variables

Properties of Covariance and Correlation

−1 ≤ Corr(X, Y) ≤ 1 is a measure of linear dependence, free of units ofmeasurement.

Cov(a + bX, c + dY) = bdCov(X, Y)

Cov(X, Y) = E[XY] − E[X]E[Y]

X, Y are statistically independent ⇒ Cov(X, Y) = Corr(X, Y) = 0.

Cov(X, Y),Corr(X, Y) 6= 0 ⇒ X, Y are statistically dependent.

Cov(X, Y) = Corr(X, Y) = 0 6⇒ X, Y are statistically independent.

Christian Conrad (Heidelberg University) Winter term 2012/13 36 / 88

Page 37: Review Course Statistics Probability Theory - Statistical

1 Review of Probability Theory1.3 Two random variables

Christian Conrad (Heidelberg University) Winter term 2012/13 37 / 88

Page 38: Review Course Statistics Probability Theory - Statistical

1 Review of Probability Theory1.3 Two random variables

Correlation and conditional mean

If E[Y|X] = µy, i.e. the conditional mean of Y does not depend on X, thenCov(X, Y) = 0 and Corr(X, Y) = 0.

However, Cov(X, Y) = 0 does not imply that the conditional mean of Y doesnot depend on X.

(see Example E2.23)

Christian Conrad (Heidelberg University) Winter term 2012/13 38 / 88

Page 39: Review Course Statistics Probability Theory - Statistical

1 Review of Probability Theory1.3 Two random variables

More an expectation, variance and covariance

Consider the random variables X, Y and Z.

E(aX + bY) = aE(X) + bE(Y)

Var(aX + bY) = a2Var(X) + b2Var(Y) + 2abCov(X, Y)

If X and Y are independent, then

Var(X + Y) = Var(X) + Var(Y)

since Cov(X, Y) = 0. Finally,

Cov(a + bX + cY, Z) = bσXZ + cσYZ

(for Proofs see Appendix 2.1)

Christian Conrad (Heidelberg University) Winter term 2012/13 39 / 88

Page 40: Review Course Statistics Probability Theory - Statistical

1 Review of Probability Theory1.3 Two random variables

Example E2.6:

The following table gives the joint probability distribution between employmentstatus and college graduation among those either employed or looking forwork (unemployed) in the working age U.S. population for 2008.

1 Compute E[Y].2 The unemployment rate is the fraction of the labor force that is

unemployed. Show that the unemployment rate is given by 1 − E[Y].3 Calculate E[Y |X = 1] and E[Y |X = 0].4 Calculate the unemployment rate for (i) college graduates and (ii)

non-college graduates.5 A randomly selected member of this population reports being

unemployed. What is the probability that this worker is a collegegraduate? A non-college graduate?

6 Are educational achievement and employment status independent?Explain.

Christian Conrad (Heidelberg University) Winter term 2012/13 40 / 88

Page 41: Review Course Statistics Probability Theory - Statistical

1 Review of Probability Theory1.3 Two random variables

Christian Conrad (Heidelberg University) Winter term 2012/13 41 / 88

Page 42: Review Course Statistics Probability Theory - Statistical

1 Review of Probability Theory1.3 Two random variables

Example E2.19: Consider two random variables X and Y. Suppose that Ytakes on k values y1, . . . yk and that X takes on l values x1, . . . , xl.

1 Show that P(Y = yj) =∑l

i=1 P(Y = yj |X = xi)P(X = xi). [Hint: Use thedefinition of P(Y = yj |X = xi).]

2 Use your answer to 1. to verify the equationE[Y] =

∑li=1 E[Y |X = xi]P(X = xi).

3 Suppose that X and Y are independent. Show that σXY = 0 andCorr(X, Y) = 0.

Christian Conrad (Heidelberg University) Winter term 2012/13 42 / 88

Page 43: Review Course Statistics Probability Theory - Statistical

1 Review of Probability Theory1.3 Two random variables

Example E2.20: Consider three random variables X, Y and Z. Suppose that Ytakes on k values y1, . . . yk, that X takes on l values x1, . . . , xl and that Z takeson m values z1, . . . , zm. The joint probability distribution of X, Y, Z isP(X = x, Y = y, Z = z), and the conditional probability distribution of Y given Xand Z is

P(Y = y |X = x, Z = z) =P(X = x, Y = y, Z = z)

P(X = x, Z = z).

1 Explain how the marginal probability that Y = y can be calculated fromthe joint probability distribution. [Hint: This is a generalization of theequation P(Y = y) =

∑li=1 P(X = xi, Y = y).]

2 Show that E[Y] = E[E[Y |X, Z]]. [Hint: This is a generalization of theequations E[Y] =

∑li=1 E[Y |X = xi]P(X = xi) and E[Y] = E[E[Y |X]].]

Christian Conrad (Heidelberg University) Winter term 2012/13 43 / 88

Page 44: Review Course Statistics Probability Theory - Statistical

1 Review of Probability Theory1.3 Two random variables

Example E2.23: This exercise provides an example of a pair of randomvariables X and Y for which the conditional mean of Y given X depends on Xbut Corr(X, Y) = 0. Let X and Z be two independently distributed standardnormal random variables, and let Y = X2 + Z.

1 Show that E[Y |X] = X2.2 Show that µY = 1.3 Show that E[XY] = 0. [Hint: Use the fact that the odd moments of a

standard normal random variable are all zero.]4 Show that Cov(X, Y) = 0 and thus Corr(X, Y) = 0.

Christian Conrad (Heidelberg University) Winter term 2012/13 44 / 88

Page 45: Review Course Statistics Probability Theory - Statistical

1 Review of Probability Theory1.3 Two random variables

Example E2.26: Suppose that Y1, Y2, . . .Yn are random variables with acommon mean µY , a common variance σ2

Y , and the same correlation ρ (so thatthe correlation between Yi and Yj is equal to ρ for all pairs i and j where i 6= j).

1 Show that Cov(Yi, Yj) = ρσ2Y for i 6= j.

2 Suppose that n = 2. Show that E[Y] = µY and Var[Y] = 12σ

2Y + 1

2ρσ2Y .

3 For n ≥ 2, show that E[Y] = µY and Var[Y] = 1nσ

2Y + n−1

n ρσ2Y .

4 When n is very large, show that Var[Y] ≈ ρσ2Y .

Christian Conrad (Heidelberg University) Winter term 2012/13 45 / 88

Page 46: Review Course Statistics Probability Theory - Statistical

1 Review of Probability Theory1.4 Random sampling and the distribution of the sample average

Random sampling

n objects, denoted by X1,X2, . . . ,Xn, are randomly drawn from a population.That is, the Xi are random variables, independently and identically distributed(i.i.d.).

The sampling distribution of the sample average

The sample average is defined as Xn = 1n

∑ni=1 Xi and is a random variable

itself, i.e. has a pdf called the sampling distribution. The mean and thevariance of X are given by

E[Xn] = µX

and

Var[Xn] =1nσ2

X.

If each Xi is normally distributed, i.e. Xi ∼ N (µX , σ2X), then Xn ∼ N (µX, σ

2X/n),

since the sum of normally distributed random variables is again normallydistributed.

Christian Conrad (Heidelberg University) Winter term 2012/13 46 / 88

Page 47: Review Course Statistics Probability Theory - Statistical

1 Review of Probability Theory1.5 Large-sample approximations to the sampling distributions

We have seen that we can derive the exact sampling distribution of Xn, if eachof the Xi is normally distributed. Since this result holds for any value of n, thesampling distribution is called the finite-sample distribution of Xn.

In practice, we often do not know the distribution of the Xi. Nevertheless, inthis situation we will be able to make statements about the asymptoticdistribution of Xn. That is, we will provide an approximation to the samplingdistribution which becomes exact in the limit, i.e. for n → ∞.

Christian Conrad (Heidelberg University) Winter term 2012/13 47 / 88

Page 48: Review Course Statistics Probability Theory - Statistical

1 Review of Probability Theory1.5 Large-sample approximations to the sampling distributions

Convergence in probability and the law of large numbers

Convergence in ProbabilityLet z1, z2, . . . , zn, . . . be a sequence of random variables. The sequence zn issaid to converge in probability to a constant c, if for any ε > 0

limn→∞

P(|zn − c| ≥ ε) = 0.

That is, the probability that zn is in the range c − ε to c + ε tends to 1 asn → ∞. Notation:

znP−→ c or plim zn = c

A sequence of random vectors (matrices) converges in probability if eachelement converges in probability.

Useful result: if znP−→ c and yn

P−→ d, then

zn + ynP−→ c + d and znyn

P−→ cd

Christian Conrad (Heidelberg University) Winter term 2012/13 48 / 88

Page 49: Review Course Statistics Probability Theory - Statistical

1 Review of Probability Theory1.5 Large-sample approximations to the sampling distributions

Law of Large Numbers (LLN)If X1, . . . ,Xn are independently and identically distributed (i.i.d.) with E[Xi] = µX

and Var[Xi] < ∞, then

Xn =1n

n∑

i=1

XiP−→ µX .

The LLN says that the sample average Xn converges in probability to thepopulation mean.

We can prove the LLN by using Chebychev’s inequality (see Appendix 17.2):if Y is a random variable, c is any constant, then

P(|Y − c| ≥ ε) ≤ E[(Y − c)2]

ε2,

for any positive constant ε.

Christian Conrad (Heidelberg University) Winter term 2012/13 49 / 88

Page 50: Review Course Statistics Probability Theory - Statistical

1 Review of Probability Theory1.5 Large-sample approximations to the sampling distributions

Christian Conrad (Heidelberg University) Winter term 2012/13 50 / 88

Page 51: Review Course Statistics Probability Theory - Statistical

1 Review of Probability Theory1.5 Large-sample approximations to the sampling distributions

Convergence in distribution and the central limit theorem

Convergence in DistributionLet Sn be a sequence of random variables and Fn the cdf of Sn. We say thatSn converges in distribution to a random variable S if the cdf Fn of Sn

converges to the cdf F of S at every continuity point of F. We call F theasymptotic distribution of Sn.

Notation:Sn

d−→ S or simply Snd−→ F

Central Limit Theorem (CLT)Let Xi be i.i.d. with E[Xi] = µX and Var[Xi] = σ2

X < ∞. Then,

√n

Xn − µX

σX

d−→ N (0, 1).

The central limit theorem states that the distribution of the standardizedsample average becomes arbitrarily well approximated by the standardnormal distribution as n → ∞.

Christian Conrad (Heidelberg University) Winter term 2012/13 51 / 88

Page 52: Review Course Statistics Probability Theory - Statistical

1 Review of Probability Theory1.5 Large-sample approximations to the sampling distributions

Christian Conrad (Heidelberg University) Winter term 2012/13 52 / 88

Page 53: Review Course Statistics Probability Theory - Statistical

1 Review of Probability Theory1.5 Large-sample approximations to the sampling distributions

Christian Conrad (Heidelberg University) Winter term 2012/13 53 / 88

Page 54: Review Course Statistics Probability Theory - Statistical

1 Review of Probability Theory1.5 Large-sample approximations to the sampling distributions

Slutsky’s Theorem

If znP−→ c and Sn

d−→ S, then

1 zn + Snd−→ c + S

2 znSnd−→ cS

3 Sn/znd−→ S/c if c 6= 0

Continuous Mapping TheoremIf g is a continuous function, then

1 if znP−→ c then g(zn)

P−→ g(c)

2 if Snd−→ S then g(Sn)

d−→ g(S)

Christian Conrad (Heidelberg University) Winter term 2012/13 54 / 88

Page 55: Review Course Statistics Probability Theory - Statistical

2 Review of Statistics

2 Review of Statistics2.1 Estimation of the population mean2.2 Hypothesis tests concerning the population mean2.3 Confidence intervals for the population mean

Christian Conrad (Heidelberg University) Winter term 2012/13 55 / 88

Page 56: Review Course Statistics Probability Theory - Statistical

2 Review of Statistics

Statistical tools help us answer questions about unknown characteristics ofdistributions in populations of interest.E.g. what is the mean of the distribution of earnings of recent collegegraduates?

The key insight of statistics is that one can learn about a populationdistribution by selecting a random sample from that population.E.g. rather than survey the entire U.S. population, we might survey, say, 1000members of the population, selected by random sampling.

Most of the interesting questions in economics involve relationships betweentwo or more variables or comparisons between different populations.E.g. is there a gap between the mean earnings for male and female recentcollege graduates?

Christian Conrad (Heidelberg University) Winter term 2012/13 56 / 88

Page 57: Review Course Statistics Probability Theory - Statistical

2 Review of Statistics

Three types of statistical methods are used in econometrics:

estimation: computing a “best guess” numerical value for an unknowncharacteristic of a population distribution.

hypothesis testing: formulating a specific hypothesis about thepopulation, then using sample evidence to decide whether it is true.

confidence intervals: using a set of data to estimate an interval or rangefor an unknown population characteristic.

Christian Conrad (Heidelberg University) Winter term 2012/13 57 / 88

Page 58: Review Course Statistics Probability Theory - Statistical

2 Review of Statistics2.1 Estimation of the population mean

Estimator and estimate

Let X1, . . .Xn be a sequence of i.i.d. random variables and ϑ an unknowncharacteristic of the distribution of Xi.

A function ϑn = ϑ(X1, ...,Xn) is called estimator of ϑ.

The realized value ϑ(x1, ..., xn) of an estimator ϑ(X1, ...,Xn) is called theestimate of ϑ based on the sample x1, ..., xn.

While ϑ(X1, ...,Xn) is a random variable, ϑ(x1, ..., xn) is a nonrandomnumber.

Christian Conrad (Heidelberg University) Winter term 2012/13 58 / 88

Page 59: Review Course Statistics Probability Theory - Statistical

2 Review of Statistics2.1 Estimation of the population mean

Properties of an estimatorAn estimator ϑn of an unknown parameter ϑ is

unbiased ifE(ϑn) = ϑ

asymptotically unbiased if

limn→∞

E(ϑn) = ϑ

consistent if ϑn converges in probability to ϑ, that is, for all ε > 0 we have

limn→∞

P(∣

∣ϑn − ϑ∣

∣ ≥ ε)

= 0

Let ϑn be another estimator of ϑ and suppose that both ϑn and ϑn areunbiased. Then, ϑn is said to be more efficient than ϑn ifVar[ϑn] < Var[ϑn].

Christian Conrad (Heidelberg University) Winter term 2012/13 59 / 88

Page 60: Review Course Statistics Probability Theory - Statistical

2 Review of Statistics2.1 Estimation of the population mean

Mean square errorThe mean square error (MSE) of an estimator is defined as

MSE(ϑn) = E[(ϑn − ϑ)2] = (E[ϑn]− ϑ)2 + Var[ϑn].

Hence, for unbiased estimators: MSE(ϑn) = Var(ϑn).

If the estimator is asymptotically unbiased and its variance goes to zero, i.e.

limn→∞

E[ϑn]− ϑ = 0 and limn→∞

Var[ϑn] = 0,

ϑn is said to converge in mean square to ϑ, which implies that ϑn isconsistent.1 Why?

1A sequence of random variables zn converges in mean square to a constant c, if

limn→∞

E[(zn − c)2] = 0

Convergence in mean square implies convergence in probability.Christian Conrad (Heidelberg University) Winter term 2012/13 60 / 88

Page 61: Review Course Statistics Probability Theory - Statistical

2 Review of Statistics2.1 Estimation of the population mean

Properties of X

Example: What are the properties of the following three estimators ofµX = E(X):

a) ϑa = 1n

∑ni=1 Xi

b) ϑb = X1

c) ϑc = ϑa +1n

Christian Conrad (Heidelberg University) Winter term 2012/13 61 / 88

Page 62: Review Course Statistics Probability Theory - Statistical

2 Review of Statistics2.1 Estimation of the population mean

Efficiency of X

Let µX be an estimator of µX that is a linear function of X1, . . . ,Xn, that is,µX =

∑ni=1 aiXi, where a1, . . . , a2 are nonrandom constants. If µX is unbiased,

then Var(X) < Var(µX) unless µX = X. Thus X is the Best Linear UnbiasedEstimator (BLUE); that is, X is the most efficient estimator of µX among allunbiased estimators that are linear in the Xi’s. (Proof ?)

Another motivation: for which choice of m is

n∑

i=1

(Xi − m)2

minimized, i.e. what is the best predictor of Xi in a mean square error sense?Again, this is X! (see Appendix 3.2)

Christian Conrad (Heidelberg University) Winter term 2012/13 62 / 88

Page 63: Review Course Statistics Probability Theory - Statistical

2 Review of Statistics2.1 Estimation of the population mean

Estimating the variance and covariance

The variance σ2X can be estimated by the sample variance:

s2X =

1n − 1

n∑

i=1

(Xi − X)2

The sample variance is an unbiased estimator of the population variance (thisis to be shown in E3.18). The sample variance is also consistent (seeAppendix 3.3). Is sX =

s2X a consistent estimator of σX?

Similarly, we can show that the sample covariance

sXY =1

n − 1

n∑

i=1

(Xi − X)(Yi − Y)

is unbiased and consistent (see E3.20).

Christian Conrad (Heidelberg University) Winter term 2012/13 63 / 88

Page 64: Review Course Statistics Probability Theory - Statistical

2 Review of Statistics2.1 Estimation of the population mean

Example E3.2 Let Y be a Bernoulli random variable with success probabilityP(Y = 1) = p, and let Y1, . . . , Yn be i.i.d. draws from this distribution. Let p bethe fraction of successes (1s) in this sample.

1 Show that p = Y.2 Show that p is an unbiased estimator of p.3 Show that Var(p) = p(1−p)

n .

Example E3.18 This exercise shows that the sample variance is an unbiasedestimator of the population variance when Y1, . . . , Yn are i.i.d. with mean µY

and variance σ2Y .

1 Use the equation Var(aX + bY) = a2σ2X + 2abσXY + b2σ2

Y to show thatE[(Yi − Y)2] = Var(Yi)− 2Cov(Yi, Y) + Var(Y).

2 Use the equation Cov(a + bX + cV, Y) = bσXY + cσVY to show that

Cov(Y, Yi) =σ2

Yn .

3 Use the results in 1. and 2. to show that E[s2Y ] = σ2

Y .

Christian Conrad (Heidelberg University) Winter term 2012/13 64 / 88

Page 65: Review Course Statistics Probability Theory - Statistical

2 Review of Statistics2.1 Estimation of the population mean

Example E3.191 Y is an unbiased estimator of µY . Is Y

2an unbiased estimator of µ2

Y?

2 Y is a consistent estimator of µY . Is Y2

a consistent estimator of µ2Y?

Christian Conrad (Heidelberg University) Winter term 2012/13 65 / 88

Page 66: Review Course Statistics Probability Theory - Statistical

2 Review of Statistics2.2 Hypothesis tests concerning the population mean

We need to introduce some more distributions

Gauss-statisticConsider the sequence X1, ...,Xn of random variables with Xi

i.i.d.∼ N (µX , σ2X).

Xn = 1n

∑ni=1 Xi is an estimator of µX with

Xn ∼ N (µX ,σ2

n ) and Z = Xn−µX

σX/√

n ∼ N (0, 1).

Z is called the Gauss-statistic.

χ2-distributionConsider the sequence X1, ...,Xn of random variables with Xi

i.i.d.∼ N(0, 1). Then

Y =

n∑

i=1

X2i ∼ χ2(n).

We say that Y is χ2-distributed with n degrees of freedom. What are the meanand variance of Y?Later on we will use that (n − 1)s2

X/σ2X ∼ χ(n − 1). (To get some intuition see

E2.24)Christian Conrad (Heidelberg University) Winter term 2012/13 66 / 88

Page 67: Review Course Statistics Probability Theory - Statistical

2 Review of Statistics2.2 Hypothesis tests concerning the population mean

t-distributionConsider the random variables X ∼ N (0, 1) and Y ∼ χ2(n); X, Y areindependent. Then the distribution of the random variable T with

T =X√

Yn

is called the t-distribution with n degrees of freedom.

Christian Conrad (Heidelberg University) Winter term 2012/13 67 / 88

Page 68: Review Course Statistics Probability Theory - Statistical

2 Review of Statistics2.2 Hypothesis tests concerning the population mean

Null and alternative hypothesis

The starting point of statistical hypothesis testing is specifying the hypothesisto be testet, called the null hypothesis. Hypothesis testing entails using datato compare the null hypothesis to a second hypothesis, called the alternativehypothesis that holds if the null does not.

We are interested in testing the hypothesis that the population mean µX takeson a specific value, µX,0:

H0 : µX = µX,0

The most general alternative hypothesis is

H1 : µX 6= µX,0.

Because under the alternative µX can be either less than or greater than µX,0

it is called a two-sided alternative.

Christian Conrad (Heidelberg University) Winter term 2012/13 68 / 88

Page 69: Review Course Statistics Probability Theory - Statistical

2 Review of Statistics2.2 Hypothesis tests concerning the population mean

The problem facing facing the statistician is to use the evidence in a randomlyselected sample of data to decide whether to “accept” the null hypothesis or toreject it in favor of the alternative hypothesis.

In any given sample, the sample average X will rarely be exactly equal tohypothesized value µX,0. Differences between X and µX,0 can arise becausethe true mean in fact does not equal µX,0 (the null hypothesis is false) orbecause the mean equals µX,0 (the null hypothesis is true) but X differs fromµX,0 because of random sampling.

When we undertake a statistical test, we can make two types of mistakes: wecan incorrectly reject the null hypothesis when it is true (type I error ), or wecan fail to reject the null hypothesis when it is false (type II error ).

Christian Conrad (Heidelberg University) Winter term 2012/13 69 / 88

Page 70: Review Course Statistics Probability Theory - Statistical

2 Review of Statistics2.2 Hypothesis tests concerning the population mean

How to proceed:

Specify the null and alternative hypothesis

Prespecify the probability of making a type I error, i.e. the significancelevel α:

α = P(rejecting H0|H0 is true).

Typically, we choose α = 0.05.

Derive a test statistic T = T(X1, . . . ,Xn) and its distribution under H0.

Determine a certain critical value such that the null hypothesis will berejected, if the test statistic exceeds this value. The set of values of thetest statistic for which the test rejects the null hypothesis is the rejectionregion (R), and the values of the test statistic for which it does not rejectthe null hypothesis is the acceptance region (A).

The null hypothesis is rejected if tact = T(x1, . . . , xn) ∈ R.

Christian Conrad (Heidelberg University) Winter term 2012/13 70 / 88

Page 71: Review Course Statistics Probability Theory - Statistical

2 Review of Statistics2.2 Hypothesis tests concerning the population mean

The t-Statistic (two-sided test)

X1, ...,Xn are i.i.d. N (µX , σ2X).

Case 1: µX unknown, σ2X known.

1 H0 : µX = µX,0 against H1 : µX 6= µX,0

2 e.g. α = 5%3 If H0 is true, then

T =√

nX − µX,0

σX∼ N (0, 1)

4 Determine the critical value:

α = P(|T| > z1−α

2)

5 Reject H0 ⇔ |tact| > z1−α

2. (For α = 0.05 the critical value is z0.975 = 1.96.)

Christian Conrad (Heidelberg University) Winter term 2012/13 71 / 88

Page 72: Review Course Statistics Probability Theory - Statistical

2 Review of Statistics2.2 Hypothesis tests concerning the population mean

Case 2: µX unknown, σ2X unknown.

What happens if we replace σX by sX in the test statistic?

√n

X − µX,0

sX=

√n X−µX,0

σX√

(n−1)s2X/σ

2X

n−1

∼ t(n − 1)

since the numerator is N (0, 1), the denominator is the square roof a χ2(n − 1)random variable and X and s2

X are independently distributed.

Christian Conrad (Heidelberg University) Winter term 2012/13 72 / 88

Page 73: Review Course Statistics Probability Theory - Statistical

2 Review of Statistics2.2 Hypothesis tests concerning the population mean

Now: X1, ...,Xn are i.i.d., E[X4i ] < ∞, but the distribution is unknown

Show that in this case:

T =√

nX − µX,0

sX

d−→ N (0, 1)

Thus, as long as the sample size is large, the distribution of the test statistic iswell approximated by the standard normal distribution.

Christian Conrad (Heidelberg University) Winter term 2012/13 73 / 88

Page 74: Review Course Statistics Probability Theory - Statistical

2 Review of Statistics2.2 Hypothesis tests concerning the population mean

The p-Value

The p-value is the probability of drawing a statistic at least as adverse to thenull hypothesis as the one you actually computed in your sample, assumingthe null hypothesis is correct:

p − value = P

(

X − µX,0

SE(X)

>

|Xact − µX,0

SE(X)

)

= P(|T| > |tact|) = 2(1 − Φ(|tact|)

with SE(X) = sx/√

n being the standard error of X.

For a prespecified α, reject the null hypothesis if p < α, otherwise do notreject.

Christian Conrad (Heidelberg University) Winter term 2012/13 74 / 88

Page 75: Review Course Statistics Probability Theory - Statistical

2 Review of Statistics2.2 Hypothesis tests concerning the population mean

Christian Conrad (Heidelberg University) Winter term 2012/13 75 / 88

Page 76: Review Course Statistics Probability Theory - Statistical

2 Review of Statistics2.2 Hypothesis tests concerning the population mean

Sometimes we are interested in one-sided alternative hypothesis, which canbe written as

H1 : µX > µX,0

orH1 : µX < µX,0.

In this case the p-value is given by

p − value = 1 − Φ(|tact|).

The N (0, 1) critical value for a one-sided test with α = 0.05 is 1.64 and -1.64,respectively.

Christian Conrad (Heidelberg University) Winter term 2012/13 76 / 88

Page 77: Review Course Statistics Probability Theory - Statistical

2 Review of Statistics2.3 Confidence intervals for the population mean

Because of random sampling error, it is impossible to learn the exact value ofthe population mean of X using only the information in the sample. However, itis possible to use data from a random sample to construct a set of values thatcontains the true population mean µX with a certain prespecified probability.Such a set is called a confidence interval and the probability α is calledconfidence level.E.g, for α = 0.05 the confidence interval for the mean corresponds to the setof values for the which the null hypothesis cannot be rejected:

[X − 1.96SE(X);X + 1.96SE(X)]

Christian Conrad (Heidelberg University) Winter term 2012/13 77 / 88

Page 78: Review Course Statistics Probability Theory - Statistical

2 Review of Statistics2.3 Confidence intervals for the population mean

Example:

Assume that the body height X of the participants of a statistics lecture isnormally distributed with σX = 10. The average height in a randomsample of size n = 25 equals X = 183 cm. Test the hypothesis thatH0 : µX = 190 against the alternative H1 : µX 6= 190 at the 5% significancelevel. What is the p-value of the test?

Now, assume that the standard deviation is unknown. However, you knowthat sx = 10 cm.

Finally, skip the assumption that X is normally distributed. Now, n = 50and sx = 10 cm. How would you proceed?

Christian Conrad (Heidelberg University) Winter term 2012/13 78 / 88

Page 79: Review Course Statistics Probability Theory - Statistical

2 Review of Statistics2.3 Confidence intervals for the population mean

Example E2.24: Suppose Yi is distributed i.i.d. N (0, σ2Y ) for i = 1, 2, . . . , n.

1 Show that E[Y2i /σ

2Y) = 1.

2 Show that W = (1/σ2Y)∑n

i=1 Y2i is distributed χ2(n).

3 Show that E[W] = n. [Hint: Use your answer to 1.]4 Show that

V =Y1

√∑ni=2 Y2

in−1

is distributed t(n − 1).

Christian Conrad (Heidelberg University) Winter term 2012/13 79 / 88

Page 80: Review Course Statistics Probability Theory - Statistical

3 Matrix Algebra

3 Matrix Algebra3.1 Basic principles3.2 Multivariate statistics

Christian Conrad (Heidelberg University) Winter term 2012/13 80 / 88

Page 81: Review Course Statistics Probability Theory - Statistical

3 Matrix Algebra3.1 Basic principles

Basic principles

A matrix A is a n × K rectangular array of numbers, written as

A =

a1,1 a1,2 · · · a1,K

a2,1 a2,2 · · · a2,K...

.... . .

...an,1 an,2 · · · an,K

= (ai,j)i=1,...,n,j=1,...,K.

The transpose of a matrix, denoted as A′ is obtained by flipping the matrix onits diagonal. Thus

A′ = (aj,i)i=1,...,n,j=1,...,K.

Example:

A =

(

1 2 30 −6 7

)

A′ =

1 02 −63 7

Christian Conrad (Heidelberg University) Winter term 2012/13 81 / 88

Page 82: Review Course Statistics Probability Theory - Statistical

3 Matrix Algebra3.1 Basic principles

Special matrices

A matrix A is

square if n = K.

symmetric if A = A′ which requires ai,j = aj,i.

diagonal if the off-diagonal elements are all zero, so that ai,j = 0 if i 6= j.

is upper (lower) diagonal if all elements below (above) the diagonal equalzero.

An important diagonal matrix is the identity matrix, which has ones on thediagonal. The k × k identity matrix is denoted as

Ik =

1 0 · · · 00 1 · · · 0...

.... . .

...0 0 · · · 1

.

Christian Conrad (Heidelberg University) Winter term 2012/13 82 / 88

Page 83: Review Course Statistics Probability Theory - Statistical

3 Fundamentals of matrix algebra3.1 Basic principles

Basic operations

Matrix addition

A = (ai,j)i=1,...,n,j=1,...,K,B = (bi,j)i=1,...,n,j=1,...,K.

C = A + B = (ci,j)i=1,...,n,j=1,...,K = (ai,j + bi,j)i=1,...,n,j=1,...,K.

Example:(

1 3 11 0 0

)

+

(

0 0 57 5 0

)

=

(

1 + 0 3 + 0 1 + 51 + 7 0 + 5 0 + 0

)

=

(

1 3 68 5 0

)

Skalar multiplication

A = (ai,j)i=1,...,n,j=1,...,K, λ ∈ R.

λ · A = (λ · ai,j)i=1,...,n,j=1,...,K.

Example:

4 ·(

1 2 30 −6 7

)

=

(

4 8 120 −24 28

)

Christian Conrad (Heidelberg University) Winter term 2012/13 83 / 88

Page 84: Review Course Statistics Probability Theory - Statistical

3 Matrix Algebra3.1 Basic principles

Matrix multiplication

A = (ai,j)i=1,...,l,j=1,...,n,B = (bi,j)i=1,...,n,j=1,...,K.

C = A · B = (ci,j)i=1,...,l, j=1,...,K ci,j =

n∑

k=1

ai,k · bk,j

Example:

(

1 0 2−1 3 1

)

×

3 12 11 0

=

(

5 14 2

)

.

with

5 = 1 · 3 + 0 · 2 + 2 · 1;

1 = 1 · 1 + 0 · 1 + 2 · 0;

4 = −1 · 3 + 3 · 2 + 1 · 1;

5 = −1 · 1 + 3 · 1 + 1 · 0.

Christian Conrad (Heidelberg University) Winter term 2012/13 84 / 88

Page 85: Review Course Statistics Probability Theory - Statistical

3 Matrix Algebra3.1 Basic principles

Properties

Matrix additioni) A + B = B + Aii) (A + B) + C = A + (B + C)

Matrix multiplicationi) (A · B) · C = A · (B · C)ii) A · B 6= B · A

Example:(

1 23 4

)

·

(

0 10 0

)

=

(

0 10 3

)

,

(

0 10 0

)

·

(

1 23 4

)

=

(

3 40 0

)

.

iii) A · (B + C) = A · B + A · C(B + C) · A = B · A + C · A

iv) Multiplication with identity matrix for any n × K matrix M:

M · IK = In · M = M.

v) Idempotence: A · A = A.vi) Transpose of a product: (A · B · C)′ = C′ · B′ · A′

Christian Conrad (Heidelberg University) Winter term 2012/13 85 / 88

Page 86: Review Course Statistics Probability Theory - Statistical

3 Matrix Algebra3.1 Basic principles

Quadratic form

Consider a symmetric matrix A ∈ RK×K and a vector x ∈ R

K×1. Theexpression

x′Ax =

K∑

i=1

K∑

j=1

xiai,jxj

is called quadratic form.

Implications:

A is positive definite if x′Ax > 0 for all x 6= 0.A is negative definite if x′Ax < 0 for all x 6= 0.

Problem: Show that the matrix

A =

2 −1 0−1 2 −10 −1 2

is positive definite.Christian Conrad (Heidelberg University) Winter term 2012/13 86 / 88

Page 87: Review Course Statistics Probability Theory - Statistical

3 Matrix Algebra3.1 Basic principles

Rank and inverse of a matrixThe rank of the n × K matrix (K ≤ n)

A = (a1, ..., aK)

is the number of linearly independent columns aj and is written as rank(A). Ahas full rank if rank(A) = K.Properties:

A square k × k matrix A is said to be nonsingular if it is has full rank, e.g.rank(A) = k. This means that there is no k × 1 vector c 6= 0 such thatAc = 0.If a square k × k matrix A is nonsingular then there exists a unique matrixk × k matrix A−1 called the inverse of A which satisfies

AA−1 = A−1A = Ik.

If A is positive or negative definite, then A is nonsingular.Problem: Compute the inverse of the matrix

A =

[

8 22 1

]

.

is positive definite.Christian Conrad (Heidelberg University) Winter term 2012/13 87 / 88

Page 88: Review Course Statistics Probability Theory - Statistical

3 Matrix Algebra3.1 Basic principles

Trace of a matrix

The trace of a k × k square matrix A is defined to be the sum of the elementson the main diagonal, i.e.,

tr(A) =

k∑

i=1

ai,i

Properties for square matrices A and B and real λ are :

tr(λA) = λtr(A);

tr(A′) = tr(A);

tr(A + B) = tr(A) + tr(B);

tr(Ik) = k;

If A is an n × K matrix and B is an K × n matrix, then

tr(AB) = tr(BA).

Christian Conrad (Heidelberg University) Winter term 2012/13 88 / 88