1 bayesian essentials and bayesian regression. 2 distribution theory 101 marginal and conditional...
Post on 20-Dec-2015
217 views
TRANSCRIPT
1
Bayesian Essentials and Bayesian Regression
2
Distribution Theory 101
Marginal and Conditional Distributions:
Y X,Y XY Xp (y) p x,y dx p y x p x dx
X X,Y
xx0
0
p (x) p x,y dy
2dy 2y 2x
X
Y
1
1
X,YY X
X
p x,y 2p y x
p x 2x
y 0,xuniform
3
Simulating from Joint
X,Y XY Xp x,y p y x p x
To draw from the joint:i. draw from marginal on Xii. Condition on this draw, and draw from
conditional of Y|X
4
The Goal of InferenceMake inferences about unknown quantities using available information.
Inference -- make probability statements
unknowns --
parameters, functions of parameters, states or latent variables, “future” outcomes, outcomes conditional on an action
Information –
data-based
non data-based
theories of behavior; “subjective views” there is an underlying structure
parameters are finite or in some range
5
Data Aspects of Marketing Problems
Ex: Conjoint Survey500 respondents rank, rate, chose among product configurations. Small Amount of Information per respondentResponse Variable Discrete
Ex: Retail Scanning Datavery large number of productslarge number of geographical units (markets, zones, stores)limited variation in some marketing mix varsMust make plausible predictions for decision making!
6
The likelihood principle
LP: the likelihood contains all information relevant for inference. That is, as long as I have same likelihood function, I should make the same inferences about the unknowns.
Implies analysis is done conditional on the data,.
p(D | ) ( )
Note: any function proportional to data density can be called the likelihood.
7
p(|D) p(D| ) p()
Posterior “Likelihood” × Prior
Modern Bayesian computing– simulation methods for generating draws from the posterior distribution p(|D).
Bayes theorem
p D pp D,
p Dp D p D
8
Summarizing the posterior
Output from Bayesian Inf:A high dimensional dist
Summarize this object via simulation:marginal distributions of don’t just compute
p D
, h E D , Var D
9
Prediction
p(D|D) p(D| )p( |D)d
ˆ( p(D| )) !!!)
p(D|D)See D, compute: “Predictive Distribution”
assumes p D,D p D p D
10
Decision theory
Loss: L(a,) where a=action; =state of nature
Bayesian decision theory:
Estimation problem is a special case:
|Damin L(a) E L(a, ) L(a, )p( |D)d
ˆ
ˆ ˆ ˆ ˆ ˆmin L( ) ; typically, L , ' A
11
Sampling properties of Bayes estimators
ˆ D|ˆ ˆr ( ) E L( , ) L (D), p(D | )dD
An estimator is admissible if there exist no other estimator with lower risk for all values of . The Bayes estimator minimizes expected (average) risk which implies they are admissible:
D| D |D
ˆ ˆE r E E L , E E L ,
The Bayes estimator does the best for every D. Therefore, it must work as at least as well as any other estimator.
12
Bayes Inference: Summary
Bayesian Inference delivers an integrated approach to:Inference – including “estimation” and “testing”Prediction – with a full accounting for uncertaintyDecision – with likelihood and loss (these are distinct!)
Bayesian Inference is conditional on available info.
The right answer to the right question.
Bayes estimators are admissible. All admissible estimators are Bayes (Complete Class Thm). Which Bayes estimator?
13
Bayes/Classical Estimators
p
n
N
Prior washes out – locally uniform!!! Bayes is consistent unless you have dogmatic prior.
MLE
1
ˆMLEˆp D ~ N , H
14
Benefits/Costs of Bayes Inf
Benefits-finite sample answer to right questionfull accounting for uncertaintyintegrated approach to inf/decision
“Costs”-computational (true any more? < classical!!)prior (cost or benefit?)
esp. with many parms(hierarchical/non-parameric problems)
15
Bayesian Computations
Before simulation methods, Bayesians used posterior expectations of various functions as summary of posterior.
If p(θ|D) is in a convenient form (e.g. normal), then I might be able to compute this for some h. Via iid simulation for all h.
D
p D pE h h d
p D
note : p D p D p d
16
Conjugate Families
Models with convenient analytic properties almost invariably come from conjugate families.
Why do I care now?- conjugate models are used as building blocks- build intuition re functions of Bayesian inference
Definition:A prior is conjugate to a likelihood if the
posterior is in the same class of distributions as prior.
Basically, conjugate priors are like the posterior from some imaginary dataset with a diffuse prior.
17
Beta-Binomial model
)(~ Bernyi
n
i
yy ii
1
1)1()(
yny )1(
n
iiyywhere
1
?)|( yp
Need a prior!
18
Beta distribution 1 1Beta( , ) (1 )
E[ ] /( )
0.0
00
.02
0.0
40
.06
0.0
80
.10
0.0 0.2 0.4 0.6 0.8 1.0
a=2, b=4a=3, b=3a=4, b=2
19
Posterior
p( | D) p(D | )p( )
y n y 1 1(1 ) (1 )
y 1 n y 1(1 )
~ Beta( y, n y)
20
Prediction
Pr(y 1|y) Pr(y 1| ,y)d
1
0
p( |y)d
E[ |y]
21
Regression model
'i i iy x 2
i ~ Normal(0, )
2 2y X, , ~ N(X , I)
2 ' 2
i i i22
1 1p(y | , ) exp (y x )
22
Is this model complete? For non-experimental data, don’t we need a model for the joint distribution of y and x?
22
Regression model
2p(x,y) p x p y x, ,
If Ψ is a priori indep of (β,σ),
2 2 2
i i ii i
p , , y,X p p x p , p y x , ,
rules out x=f(β)!!!
two separate analyses
simultaneous systems are not written this way!
23
Conjugate Prior
2
2n
n/ 22 2 2 12
~ N 0, I
Jacobian : y 1
, p y X, , exp y X '
What is conjugate prior? Comes from form of likelihood function. Here we condition on X.
quadratic form suggests normal prior.
Let’s – complete the square on β or rewrite by projecting y on X (column space of X).
24
Geometry of regression
x1
x2
y
1x1
2x2
2211ˆ xxy
yye ˆ
"LeastSquares" min (e 'e)
ˆe 'y 0
25
Traditional regression
1ˆ (X 'X) X 'y
ˆ ˆe 'y y 'e 0 ˆ ˆ(X )'(y X ) 0
No one ever computes a matrix inverse directly.
Two numerically stable methods:
QR decomposition of X
Cholesky root of X’X and compute inverse using root
Non-Bayesians have to worry about singularity or near singularity of X’X. We don’t! more later
26
In Bayesian computations, the fundamental matrix operation is the Cholesky root. chol() in R
The Cholesky root is the generalization of the square root applied to positive definite matrices.
As Bayesians with proper priors, we don’t ever have to worry about singular matrices!
U is upper triangular with positive diagonal elements. U-1 is easy to compute by recursively solving TU = I for T, backsolve() in R.
Cholesky Roots
2
iii
U'U, p.d.s. u
27
Cholesky roots can be useful to simulate from Multivariate Normal Distribution.
Cholesky Roots
U'U; z ~ N 0,I
y U'z ~ N 0,E U'zz 'U U'IU
To simulate a matrix of draws from MVN (each row is a separate draw) in R,
Y=matrix(rnorm(n*k),ncol=k)%*%chol(Sigma)
Y=t(t(Y)+mu)
28
Regression with R
data.txt:
UNIT Y X1 X2
A 1 0.23815 0.43730
A 2 0.55508 0.47938
A 3 3.03399 -2.17571
A 4 -1.49488 1.66929
B 10 -1.74019 0.35368
B 9 1.40533 -1.26120
B 8 0.15628 -0.27751
B 7 -0.93869 -0.04410
B 6 -3.06566 0.14486
df=read.table("data.txt",header=TRUE)
myreg=function(y,X){## purpose: compute lsq regression## arguments: # y -- vector of dep var# X -- array of indep vars## output:# list containing lsq coef and std errors#XpXinv=chol2inv(chol(crossprod(X)))bhat=XpXinv%*%crossprod(X,y)res=as.vector(y-X%*%bhat)ssq=as.numeric(res%*%res/(nrow(X)-ncol(X)))se=sqrt(diag(ssq*XpXinv))list(b=bhat,std_errors=se)}
29
Regression likelihood
ˆ ˆ ˆ ˆ(y X )'(y X ) (X X )'(X X ) (y X )'(y X )
ˆ ˆ2(y X )'(X X )
2 ˆ ˆs ( )'X 'X( )
2 ˆ ˆs SSE (y X )'(y X )
n k
where
2 2 k / 22
22 / 2
2
1 ˆ ˆp y X, , ( ) exp ( )'X 'X( )2
s( ) exp
2
30
Regression likelihood
2p y X, , normal ?
? is density of form p e
This is called an inverted gamma distribution. It can also be related to the inverse of a Chi-squared distribution.
Note the conjugate prior suggested by the form the likelihood has a prior on β which depends on σ.
31
Bayesian Regression
2 2 2p( , ) p( | )p( )Prior:
2 2 k / 2
2
1p( | ) ( ) exp ( )' A( )
2
0 212 2 2 0 0
2
sp( ) ( ) exp
2
Inverted Chi-Square:
0
22 0 0
2
s~
Interpretation as from another dataset.
Draw from prior?
32
Posterior
2 2 2 2p( , |D) ( , )p( | )p( )
2 n/ 2
2
1( ) exp (y X )'(y X )
2
2 k / 2
2
1( ) exp ( )' A( )
2
0 212 2 0 0
2
s( ) exp
2
33
Combining quadratic forms
(y X )'(y X ) ( )' A( )
(y X )'(y X ) ( )'U'U( )
(v W )'(v W )
y Xv W
U U
2(v W )'(v W ) s ( )'W 'W( )
1 1 ˆ(W 'W) W 'v (X 'X A) (X 'X A )
2ns (v W )'( W ) (y X )'(y X ) ( )'A ( )
34
Posterior
2 k / 22
1( ) exp ( )'(X 'X A)( )
2
0n 2 2 22 0 02
2
( s ns )( ) exp
2
2 2 1[ | ] N( , (X 'X A) )
1
22 1 1
1 02
2 22 0 01
0
s[ ] with n
s nss
n
35
IID Simulations
3) Repeat
1) Draw [2 | y, X]
2) Draw [ | 2,y, X]
Scheme: [y|X, , 2] [|2] [2]
36
IID Simulator, cont.
12 2
1
12
2) y,X, N , X' X A
ˆ(X' X A) (X' X A )
note : ~ N 0,I ; U' ~ N ,U'U X' X A
1
22 1 1
2
s1) [ |y,X]
37
Bayes Estimator
2 22D DD,E D E E E
The Bayes Estimator is the posterior mean of β.
Marginal on β is a multivariate student t.
Who cares?
38
Shrinkage and Conjugate Priors
1 ˆ ˆ(X 'X A) (X 'X A ) shrinks
The Bayes Estimator is the posterior mean of β.
This is a “shrinkage” estimator.
ˆas n , (Why? X'X is of order n).
12 2 1 2 1 2Var (X 'X A) A or X 'X
Is this reasonable?
39
Assessing Prior Hyperparameters
20 0,A, ,s
These determine prior location and spread for both coefs and error variance.
It has become customary to assess a “diffuse” prior:
2k k
0
20
0
"small" value of A, A .01I Var 100I
"small",e.g. 3
s 1This can be problematic. Var(y) might be a better choice.
40
Improper or “non-informative” priors
2 22
1p , p p
Classic “non-informative” prior (improper):
Is this “non-informative”?
Of course not, it says that is large with high prior “probability”
Is this wise computationally?
No, I have to worry about singularity in X’X
Is this a good procedure?
No, it is not admissible. Shrinkage is good!
41
runiregrunireg=function(Data,Prior,Mcmc){# # purpose: # draw from posterior for a univariate regression model with natural conjugate prior## Arguments:# Data -- list of data # y,X# Prior -- list of prior hyperparameters# betabar,A prior mean, prior precision# nu, ssq prior on sigmasq# Mcmc -- list of MCMC parms# R number of draws# keep -- thinning parameter# # output:# list of beta, sigmasq draws# beta is k x 1 vector of coefficients# model:# Y=Xbeta+e var(e_i) = sigmasq# priors: beta| sigmasq ~ N(betabar,sigmasq*A^-1)# sigmasq ~ (nu*ssq)/chisq_nu
42
runiregRA=chol(A)
W=rbind(X,RA)
z=c(y,as.vector(RA%*%betabar))
IR=backsolve(chol(crossprod(W)),diag(k))
# W'W=R'R ; (W'W)^-1 = IR IR' -- this is UL decomp
btilde=crossprod(t(IR))%*%crossprod(W,z)
res=z-W%*%btilde
s=t(res)%*%res
#
# first draw Sigma
#
sigmasq=(nu*ssq + s)/rchisq(1,nu+n)
#
# now draw beta given Sigma
#
beta = btilde + as.vector(sqrt(sigmasq))*IR%*%rnorm(k)
list(beta=beta,sigmasq=sigmasq)
}
43
0 500 1000 1500 2000
1.0
1.5
2.0
2.5
ou
t$b
eta
dra
w
44
0 500 1000 1500 2000
0.2
00
.25
0.3
00
.35
ou
t$si
gm
asq
dra
w
45
Multivariate Regression
1 1 1
c c c
m m m
y X
y X
y X
1 c m
1 c m
1 c m
row
Y XB E,
Y y , ,y , ,y
B , , , ,
E , , , ,
~ iid N 0,
46
Multivariate regression likelihood
n
n/ 2 1r r r r
r 1
n/ 2 1
(n k) / 2 1
k / 2 1
1p Y | X,B, | | exp y B'x y B x
2
1| | etr Y XB Y XB
2
1| | etr S
2
1 ˆ ˆ| | etr B B X X B B2
ˆ ˆwhere S Y XB Y XB
47
Multivariate regression likelihood
1 1
1
But, tr(A 'B) vec A ' vec B
ˆ ˆ ˆ ˆtr B B X X B B vec B B 'vec X X B B
and vec ABC C' A vec B
ˆ ˆvec B B ' X X vec B B
(n k) / 2 1
k / 2 1
therefore,
1p Y | X,B, | | etr S
2
1 ˆ ˆ| | exp X X2
48
Inverted Wishart distribution
Form of the likelihood suggests that natural conjugate (convenient prior) for would be of the Inverted Wishart form:
denoted
0 m 1 / 2 110 0 02p ,V etr V
0 0~ IW ,V
10 0 0if m 1, E ( m 1) V
- tightness
V- location
however, as increases, spread also increases
limitations: i. small -- thick tail ii. only one tightness parm
49
Wishart distribution (rwishart)
1 10 0 0 0If ~ IW ,V , ~ W ,V
10 0 0if m 1, E V
2
i m
Generalization of :
Let ~ N (0, )
i ii 1
Then W ' ~ W( , )
2The diagonals are
50
Multivariate regression prior and posterior
0 0
1
p ,B p p B |
~ IW ,V
| ~ N , A
0 0
1
1
| Y,X ~ IW n,V S
| Y,X, ~ N , X X A
ˆvec B , B X X A X XB AB ,
S Y XB Y XB B B A B B
Prior:
Posterior:
51
Drawing from Posterior: rmultiregrmultireg=function(Y,X,Bbar,A,nu,V)RA=chol(A)W=rbind(X,RA)Z=rbind(Y,RA%*%Bbar)# note: Y,X,A,Bbar must be matrices!IR=backsolve(chol(crossprod(W)),diag(k))# W'W = R'R & (W'W)^-1 = IRIR' -- this is the UL decomp!Btilde=crossprod(t(IR))%*%crossprod(W,Z) # IRIR'(W'Z) = (X'X+A)^-1(X'Y + ABbar)S=crossprod(Z-W%*%Btilde)# rwout=rwishart(nu+n,chol2inv(chol(V+S)))## now draw B given Sigma note beta ~ N(vec(Btilde),Sigma (x) Cov) # Cov=(X'X + A)^-1 = IR t(IR) # Sigma=CICI' # therefore, cov(beta)= Omega = CICI' (x) IR IR' = (CI (x) IR) (CI (x) IR)'# so to draw beta we do beta= vec(Btilde) +(CI (x) IR)vec(Z_mk) # Z_mk is m x k matrix of N(0,1)# since vec(ABC) = (C' (x) A)vec(B), we have # B = Btilde + IR Z_mk CI'#B = Btilde + IR%*%matrix(rnorm(m*k),ncol=m)%*%t(rwout$CI)
52
Conjugacy is Fragile!
i i i iy X i 1, ,m
SUR:
given , ~ N would be conjugate
given , ~ IW would be conjugate
set of regressions “related” via correlated
errors
BUT, no joint conjugate prior!!